Artistic style transfer aims to transfer the learned artistic style onto an arbitrary content image, generating artistic stylized images. Existing generative adversarial network-based methods fail to generate highly realistic stylized images and always introduce obvious artifacts and disharmonious patterns. Recently, large-scale pre-trained diffusion models opened up a new way for generating highly realistic artistic stylized images. However, diffusion model-based methods generally fail to preserve the content structure of input content images well, introducing some undesired content structure and style patterns. To address the above problems, we propose a novel pre-trained diffusion-based artistic style transfer method, called LSAST, which can generate highly realistic artistic stylized images while preserving the content structure of input content images well, without bringing obvious artifacts and disharmonious style patterns. Specifically, we introduce a Step-aware and Layer-aware Prompt Space, a set of learnable prompts, which can learn the style information from the collection of artworks and dynamically adjusts the input images' content structure and style pattern. To train our prompt space, we propose a novel inversion method, called Step-ware and Layer-aware Prompt Inversion, which allows the prompt space to learn the style information of the artworks collection. In addition, we inject a pre-trained conditional branch of ControlNet into our LSAST, which further improved our framework's ability to maintain content structure. Extensive experiments demonstrate that our proposed method can generate more highly realistic artistic stylized images than the state-of-the-art artistic style transfer methods.
艺术风格迁移的目的是将学习到的艺术风格应用到任意内容图像上,生成艺术风格化的图像。现有的基于生成对抗网络(GAN)的方法无法生成高度逼真的艺术风格化图像,并始终引入明显的伪影和失调模式。最近,大型预训练扩散模型为生成高度逼真的艺术风格化图像开辟了新的途径。然而,扩散模型方法通常无法保留输入内容图像的內容結構,引入一些不希望的内容结构和样式模式。为了解决上述问题,我们提出了一个新颖的预训练扩散-基于的艺术风格迁移方法,称为LSAST,可以在保留输入内容图像的內容结构的同时生成高度逼真的艺术风格化图像,不会引入明显的伪影和失调样式模式。具体来说,我们引入了一个步幅感知和层感知提示空间,一系列可学习的提示,可以从艺术作品集中学习风格信息,并动态调整输入图像的內容结构和样式模式。为了训练我们的提示空间,我们提出了一个新颖的翻转方法,称为步幅感知和层感知提示翻转,允许提示空间学习艺术作品集的样式信息。此外,我们将预训练的控制网分支注入到我们的LSAST中,进一步提高了我们的框架保持内容结构的能力。大量实验证明,与最先进的艺术风格迁移方法相比,我们所提出的方法可以生成更高度逼真的艺术风格化图像。
https://arxiv.org/abs/2404.11474
This research paper proposes a novel methodology for image-to-image style transfer on objects utilizing a single deep convolutional neural network. The proposed approach leverages the You Only Look Once version 8 (YOLOv8) segmentation model and the backbone neural network of YOLOv8 for style transfer. The primary objective is to enhance the visual appeal of objects in images by seamlessly transferring artistic styles while preserving the original object characteristics. The proposed approach's novelty lies in combining segmentation and style transfer in a single deep convolutional neural network. This approach omits the need for multiple stages or models, thus resulting in simpler training and deployment of the model for practical applications. The results of this approach are shown on two content images by applying different style images. The paper also demonstrates the ability to apply style transfer on multiple objects in the same image.
本文提出了一种利用单个深度卷积神经网络进行图像到图像风格迁移的新方法来处理物体。所提出的方法利用了You Only Look Once版本8(YOLOv8)分割模型和YOLOv8的骨干网络来进行风格迁移。主要目标是通过无缝转移艺术风格来增强图像中物体的视觉吸引力,同时保留原始物体的特征。该方法的创新之处在于将分割和风格迁移结合在一个单深的卷积神经网络中。这种方法省略了多个阶段或模型,因此简化了模型的训练和部署,为实际应用提供了更简单的模型。通过在两个内容图像上应用不同的风格图像,展示了这种方法的效果。本文还展示了在同一图像中应用风格迁移处理多个物体的能力。
https://arxiv.org/abs/2404.09461
In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects.
在这项工作中,我们针对文本到图像(T2I)扩散模型的文本驱动风格迁移任务。主要挑战是在保持一致的结构同时实现有效的风格迁移效果。该领域过去的做法是将提示级的风格注入直接连接内容和风格提示,导致结构变形不可避免。在这项工作中,我们提出了一种新的解决方案,称为自适应风格集成(ASI),以实现细粒度的特征级别风格集成。它包括 Siamese Cross-Attention(SiCA)将单道跨注意解耦为双道结构以获得单独的内容和风格特征,以及自适应内容风格混合(AdaBlending)模块以以结构一致的方式将内容和风格信息耦合。实验证明,我们的方法在结构和风格保留方面都表现出更好的性能。
https://arxiv.org/abs/2404.06835
Recently, a surge of 3D style transfer methods has been proposed that leverage the scene reconstruction power of a pre-trained neural radiance field (NeRF). To successfully stylize a scene this way, one must first reconstruct a photo-realistic radiance field from collected images of the scene. However, when only sparse input views are available, pre-trained few-shot NeRFs often suffer from high-frequency artifacts, which are generated as a by-product of high-frequency details for improving reconstruction quality. Is it possible to generate more faithful stylized scenes from sparse inputs by directly optimizing encoding-based scene representation with target style? In this paper, we consider the stylization of sparse-view scenes in terms of disentangling content semantics and style textures. We propose a coarse-to-fine sparse-view scene stylization framework, where a novel hierarchical encoding-based neural representation is designed to generate high-quality stylized scenes directly from implicit scene representations. We also propose a new optimization strategy with content strength annealing to achieve realistic stylization and better content preservation. Extensive experiments demonstrate that our method can achieve high-quality stylization of sparse-view scenes and outperforms fine-tuning-based baselines in terms of stylization quality and efficiency.
最近,提出了一种利用预训练神经辐射场(NeRF)场景重构能力的3D风格迁移方法。要成功使用这种方法来风格化场景,首先需要从收集的场景图片中重构出照片真实的辐射场。然而,当仅有的输入视图稀疏时,预训练的少样本NeRFs通常会受到高频噪声的影响,这是由于为了提高重建质量而产生的高频细节。是否可以在稀疏输入下通过直接优化基于场景表示的编码来生成更忠实风格的场景呢?在本文中,我们考虑了通过解开内容语义和风格纹理来对稀疏视图场景进行风格化。我们提出了一个粗-细稀疏视图场景风格化框架,其中一种新颖的层次编码基于神经表示旨在直接生成高质量的风格化场景。我们还提出了一个新的优化策略——内容强度衰减,以实现逼真的风格化和更好的内容保留。大量实验证明,我们的方法可以在稀疏视图场景中实现高品质的风格化,并且在 stylization质量和效率方面优于基于微调的基线。
https://arxiv.org/abs/2404.05236
With the rapid development of XR, 3D generation and editing are becoming more and more important, among which, stylization is an important tool of 3D appearance editing. It can achieve consistent 3D artistic stylization given a single reference style image and thus is a user-friendly editing way. However, recent NeRF-based 3D stylization methods face efficiency issues that affect the actual user experience and the implicit nature limits its ability to transfer the geometric pattern styles. Additionally, the ability for artists to exert flexible control over stylized scenes is considered highly desirable, fostering an environment conducive to creative exploration. In this paper, we introduce StylizedGS, a 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting (3DGS) representation. The 3DGS brings the benefits of high efficiency. We propose a GS filter to eliminate floaters in the reconstruction which affects the stylization effects before stylization. Then the nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale and regions during the stylization to possess customized capabilities. Our method can attain high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference FPS.
随着XR技术的快速发展,3D生成和编辑变得越来越重要,其中,塑性是一种重要的3D外观编辑工具。通过给定单一定式风格图像,它可以实现一致的3D艺术塑性,从而成为一种用户友好的编辑方式。然而,基于NeRF的3D塑性方法面临效率问题,影响了实际用户体验,并且隐含的拓扑学限制了其将几何图案样式转移的能力。此外,艺术家对塑性场景的灵活控制被认为是高度渴望的,促进了创意探索的环境。在本文中,我们引入了StylizedGS,一种基于3D高斯平滑(3DGS)表示的3D神经风格转移框架。3DGS带来了高效率的优势。我们提出了GS滤波器来消除在重构过程中影响塑性效果的浮点。然后,基于最近邻的样式损失来实现通过对3DGS的形和色参数的微调来实现塑性,同时引入了基于其他正则化的深度保持损失,以防止对几何内容进行篡改。此外,通过专门设计的损失功能,StylizedGS使用户能够在塑性过程中控制颜色、塑性比例和区域,具有自定义功能。我们的方法可以实现高质量、具有忠实笔触和几何一致性的艺术塑性,具有灵活的控制。在各种场景和风格的大量实验中,我们证明了我们的方法在塑性和推理每秒帧数方面的有效性和效率。
https://arxiv.org/abs/2404.05220
We propose a novel approach to improve the reproducibility of neuroimaging results by converting statistic maps across different functional MRI pipelines. We make the assumption that pipelines can be considered as a style component of data and propose to use different generative models, among which, Diffusion Models (DM) to convert data between pipelines. We design a new DM-based unsupervised multi-domain image-to-image transition framework and constrain the generation of 3D fMRI statistic maps using the latent space of an auxiliary classifier that distinguishes statistic maps from different pipelines. We extend traditional sampling techniques used in DM to improve the transition performance. Our experiments demonstrate that our proposed methods are successful: pipelines can indeed be transferred, providing an important source of data augmentation for future medical studies.
我们提出了一个新方法来提高神经影像结果的可重复性,通过将不同功能MRI流程中的统计图进行转换。我们假设流程可以被视为数据的一种样式组件,并提出使用不同的生成模型,其中扩散模型(DM)用于在流程之间转换数据。我们设计了一个基于扩散模型的多域图像到图像无监督转换框架,并通过辅助分类器的潜在空间对3D fMRI统计图进行生成限制。我们扩展了在DM中使用的传统采样技术,以提高转换性能。我们的实验结果表明,我们所提出的方法是成功的:流程确实可以转移,为未来的医学研究提供了重要的数据增强来源。
https://arxiv.org/abs/2404.03703
Style transfer is a promising approach to close the sim-to-real gap in medical endoscopy. Rendering realistic endoscopic videos by traversing pre-operative scans (such as MRI or CT) can generate realistic simulations as well as ground truth camera poses and depth maps. Although image-to-image (I2I) translation models such as CycleGAN perform well, they are unsuitable for video-to-video synthesis due to the lack of temporal consistency, resulting in artifacts between frames. We propose MeshBrush, a neural mesh stylization method to synthesize temporally consistent videos with differentiable rendering. MeshBrush uses the underlying geometry of patient imaging data while leveraging existing I2I methods. With learned per-vertex textures, the stylized mesh guarantees consistency while producing high-fidelity outputs. We demonstrate that mesh stylization is a promising approach for creating realistic simulations for downstream tasks such as training and preoperative planning. Although our method is tested and designed for ureteroscopy, its components are transferable to general endoscopic and laparoscopic procedures.
风格迁移是一种有效的解决医疗内窥镜模拟与现实差距的方法。通过通过术前扫描(如MRI或CT)生成逼真的内窥镜视频以及真实相机姿态和深度图,可以生成逼真的模拟。尽管图像到图像(I2I)变换模型如CycleGAN表现良好,但由于缺乏时间一致性,导致帧间出现伪影。我们提出MeshBrush,一种基于神经网络的网格纹理化方法,用于合成具有不同程度渲染的逼真的视频。MeshBrush利用患者成像数据的底层几何,同时利用现有的I2I方法。通过学习每个顶点的纹理,纹理化的网格确保了一致性,并产生了高保真的输出。我们证明了网格纹理化是一种有前途的方法,可用于为下游任务(如培训和术前计划)创建逼真的模拟。尽管我们的方法已针对内窥镜超声检查进行了测试和设计,但它的组件可应用于其他内窥镜和腹腔镜手术。
https://arxiv.org/abs/2404.02999
Foundation models have emerged as pivotal tools, tackling many complex tasks through pre-training on vast datasets and subsequent fine-tuning for specific applications. The Segment Anything Model is one of the first and most well-known foundation models for computer vision segmentation tasks. This work presents a multi-faceted red-teaming analysis that tests the Segment Anything Model against challenging tasks: (1) We analyze the impact of style transfer on segmentation masks, demonstrating that applying adverse weather conditions and raindrops to dashboard images of city roads significantly distorts generated masks. (2) We focus on assessing whether the model can be used for attacks on privacy, such as recognizing celebrities' faces, and show that the model possesses some undesired knowledge in this task. (3) Finally, we check how robust the model is to adversarial attacks on segmentation masks under text prompts. We not only show the effectiveness of popular white-box attacks and resistance to black-box attacks but also introduce a novel approach - Focused Iterative Gradient Attack (FIGA) that combines white-box approaches to construct an efficient attack resulting in a smaller number of modified pixels. All of our testing methods and analyses indicate a need for enhanced safety measures in foundation models for image segmentation.
基础模型已经成为解决计算机视觉分割任务的关键工具,通过在庞大的数据集上进行预训练,然后针对特定应用进行微调,从而解决许多复杂任务。Segment Anything Model 是第一个也是最有名的基础模型之一,用于计算机视觉分割任务。这项工作提出了一种多方面的协同分析,针对 Segment Anything Model 进行挑战性任务: (1)我们分析了风格迁移对分割掩码的影响,证明了将不利天气条件和雨滴应用于城市道路仪表板图像会显著扭曲生成的掩码。 (2)我们关注评估模型是否可以用于隐私攻击,例如识别名人面容,并表明模型在这方面拥有一些不良知识。 (3)最后,我们检查了模型在文本提示下对分割掩码的抗攻击性。我们不仅展示了流行的高级攻击方法和对抗黑盒攻击的有效性,而且引入了一种新方法——专注于迭代攻击(FIGA),将白盒攻击方法结合构建出高效的攻击方式,从而实现更少的修改像素数量。我们所有的测试方法和分析都表明,基础模型在图像分割方面的安全性需要得到提高。
https://arxiv.org/abs/2404.02067
Text detoxification is a textual style transfer (TST) task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recently, text detoxification methods found their applications in various task such as detoxification of Large Language Models (LLMs) (Leong et al., 2023; He et al., 2024; Tang et al., 2023) and toxic speech combating in social networks (Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023). All these applications are extremely important to ensure safe communication in modern digital worlds. However, the previous approaches for parallel text detoxification corpora collection -- ParaDetox (Logacheva et al., 2022) and APPADIA (Atwell et al., 2022) -- were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language. Then, we experiment with different text detoxification models -- from unsupervised baselines to LLMs and fine-tuned models on the presented parallel corpora -- showing the great benefit of parallel corpus presence to obtain state-of-the-art text detoxification models for any language.
文本去毒是一种文本风格迁移(TST)任务,其中从具有恶意的文本形式(例如包含粗言的文本)将其转移到中立语域。最近,文本去毒方法在各种任务中的应用得到了发现,例如对大型语言模型(LLMs)的净化(Leong et al., 2023; He et al., 2024; Tang et al., 2023)和社会网络中的有毒言论对抗(Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023)。所有这些应用都非常重要,以确保现代数字世界中安全交流。然而,之前用于并行文本去毒数据集的构建的方法——ParaDetox(Logacheva et al., 2022)和APPADIA(Atwell et al., 2022)——仅在单语种设置中进行研究。在这项工作中,我们旨在将ParaDetox管道扩展到多种语言,并使用MultiParaDetox在多个语言上自动构建并行去毒数据集,从而实现对任何语言的平行文本去毒。然后,我们实验了不同的文本去毒模型——从无监督的基线到LLM和微调模型——展示了平行数据集存在的巨大好处,为任何语言获取最先进文本去毒模型。
https://arxiv.org/abs/2404.02037
Our paper addresses the complex task of transferring a hairstyle from a reference image to an input photo for virtual hair try-on. This task is challenging due to the need to adapt to various photo poses, the sensitivity of hairstyles, and the lack of objective metrics. The current state of the art hairstyle transfer methods use an optimization process for different parts of the approach, making them inexcusably slow. At the same time, faster encoder-based models are of very low quality because they either operate in StyleGAN's W+ space or use other low-dimensional image generators. Additionally, both approaches have a problem with hairstyle transfer when the source pose is very different from the target pose, because they either don't consider the pose at all or deal with it inefficiently. In our paper, we present the HairFast model, which uniquely solves these problems and achieves high resolution, near real-time performance, and superior reconstruction compared to optimization problem-based methods. Our solution includes a new architecture operating in the FS latent space of StyleGAN, an enhanced inpainting approach, and improved encoders for better alignment, color transfer, and a new encoder for post-processing. The effectiveness of our approach is demonstrated on realism metrics after random hairstyle transfer and reconstruction when the original hairstyle is transferred. In the most difficult scenario of transferring both shape and color of a hairstyle from different images, our method performs in less than a second on the Nvidia V100. Our code is available at this https URL.
我们的论文解决了将发型从参考图像转移到输入照片进行虚拟试戴的复杂任务。由于需要适应各种照片姿势、发型敏感度和缺乏客观指标,这项任务非常具有挑战性。同时,目前基于优化过程的方法速度极慢,而快速编码的模型质量也非常低,因为它们要么在StyleGAN的W+空间运行,要么使用其他低维图像生成器。此外,两种方法在发型转移时都存在问题,因为它们要么完全忽略姿势,要么处理姿势不高效。在我们的论文中,我们提出了HairFast模型,它独特地解决了这些问题,并实现了高分辨率、接近实时性能和卓越的重建效果,与基于优化问题的方法相比。我们的解决方案包括一个新的在StyleGAN的FS潜在空间中运行的架构、增强的修复方法、改进的编码器以及后处理的新编码器。在随机发型转移和重建时,我们的方法在现实主义指标上证明了其有效性。在将不同图像的发型和颜色从一个图像转移到另一个图像的最困难的情况下,我们的方法在Nvidia V100上执行的时间不到一秒。我们的代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2404.01094
In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization; for some of these tasks, there is no comparable previous 3D method. In many cases, we even outperform state-of-the-art methods specialized for the task in question. Moreover, Lift3D is a zero-shot method, in the sense that it requires no task-specific training, nor scene-specific optimization.
近年来,随着大型2D图像数据集的爆炸性增长,2D视觉模型在语义分割、风格迁移或场景编辑等任务上取得了显著突破。同时,对于多视角图像中的3D场景表示,如来自多视角图像的神经辐射场,3D场景表示的可用性仍然相对有限,这使得将2D视觉模型扩展到3D数据非常诱人,但同时也非常具有挑战性。事实上,将单个2D视觉操作扩展到3D通常需要高度创造性方法的专业领域,并且通常需要针对每个场景进行优化。在本文中,我们问是否可以提升任何2D视觉模型使其在3D中做出一致的预测。我们得出结论:是的,我们的新Lift3D方法训练预测由几个视觉模型(即DINO和CLIP)生成的特征空间中的未见过的视图,但 then 扩展到新颖的视觉操作和任务,如风格迁移、超分辨率、开词汇分割和图像色度;对于某些任务,没有 comparable之前的3D方法。在许多情况下,我们甚至超过了针对该任务的最佳3D方法。此外,Lift3D是一种零 shot方法,这意味着它不需要任务特定训练,也不需要场景特定优化。
https://arxiv.org/abs/2403.18922
Image style transfer aims to imbue digital imagery with the distinctive attributes of style targets, such as colors, brushstrokes, shapes, whilst concurrently preserving the semantic integrity of the content. Despite the advancements in arbitrary style transfer methods, a prevalent challenge remains the delicate equilibrium between content semantics and style attributes. Recent developments in large-scale text-to-image diffusion models have heralded unprecedented synthesis capabilities, albeit at the expense of relying on extensive and often imprecise textual descriptions to delineate artistic styles. Addressing these limitations, this paper introduces DiffStyler, a novel approach that facilitates efficient and precise arbitrary image style transfer. DiffStyler lies the utilization of a text-to-image Stable Diffusion model-based LoRA to encapsulate the essence of style targets. This approach, coupled with strategic cross-LoRA feature and attention injection, guides the style transfer process. The foundation of our methodology is rooted in the observation that LoRA maintains the spatial feature consistency of UNet, a discovery that further inspired the development of a mask-wise style transfer technique. This technique employs masks extracted through a pre-trained FastSAM model, utilizing mask prompts to facilitate feature fusion during the denoising process, thereby enabling localized style transfer that preserves the original image's unaffected regions. Moreover, our approach accommodates multiple style targets through the use of corresponding masks. Through extensive experimentation, we demonstrate that DiffStyler surpasses previous methods in achieving a more harmonious balance between content preservation and style integration.
图像风格迁移的目的是将数字图像赋予具有风格目标特征的鲜艳色彩、笔触、形状等,同时保留内容的语义完整性。尽管在任意风格迁移方法上取得了进步,但仍然存在一个普遍的挑战,即内容语义和风格属性之间的微妙的平衡。近年来,大规模文本到图像扩散模型的发展预示着前所未有的合成能力,但代价是依赖广泛的且经常不精确的文本描述来定义艺术风格。为解决这些局限,本文引入了DiffStyler,一种新方法,可实现高效且精确的任意图像风格迁移。DiffStyler利用基于文本到图像的稳定扩散模型(LoRA)来封装风格目标的本质。这种方法与策略的跨LoRA特征和注意注入相结合,引导风格迁移过程。我们方法的基础是观察到LoRA保持UNet的空间特征一致性,这一发现进一步激发了通过掩码级风格迁移技术的发展。这种技术利用预训练的FastSAM模型提取掩码,在去噪过程中利用掩码提示促进特征融合,从而实现局部风格迁移,保留原始图像不受影响区域。此外,通过使用相应的掩码,我们的方法可以适应多种风格目标。通过大量实验,我们证明了DiffStyler在实现内容保护和风格整合的更和谐平衡方面超越了以前的方法。
https://arxiv.org/abs/2403.18461
In Virtual Reality (VR), adversarial attack remains a significant security threat. Most deep learning-based methods for physical and digital adversarial attacks focus on enhancing attack performance by crafting adversarial examples that contain large printable distortions that are easy for human observers to identify. However, attackers rarely impose limitations on the naturalness and comfort of the appearance of the generated attack image, resulting in a noticeable and unnatural attack. To address this challenge, we propose a framework to incorporate style transfer to craft adversarial inputs of natural styles that exhibit minimal detectability and maximum natural appearance, while maintaining superior attack capabilities.
在虚拟现实(VR)中,对抗性攻击仍然是一个重要的安全威胁。大多数基于深度学习的物理和数字对抗性攻击方法都集中精力通过构建包含大量可打印的变形实例的对抗性示例来提高攻击性能。然而,攻击者很少对生成攻击图像的自然性和舒适性施加限制,导致了一种明显的不自然且不可见的攻击。为了应对这个挑战,我们提出了一个框架,将风格迁移应用于自然风格的数据,以生成具有最小检测性和最大自然外观的攻击输入,同时保持卓越的攻击能力。
https://arxiv.org/abs/2403.14778
Image stylization involves manipulating the visual appearance and texture (style) of an image while preserving its underlying objects, structures, and concepts (content). The separation of style and content is essential for manipulating the image's style independently from its content, ensuring a harmonious and visually pleasing result. Achieving this separation requires a deep understanding of both the visual and semantic characteristics of images, often necessitating the training of specialized models or employing heavy optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA (Low-Rank Adaptation) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of SDXL combined with LoRA, we find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. Consolidating the training into only two blocks and separating style and content allows for significantly improving style manipulation and overcoming overfitting issues often associated with model fine-tuning. Once trained, the two B-LoRAs can be used as independent components to allow various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing.
图像风格化涉及对图像的视觉外观和质感(风格)进行操作,同时保留其潜在的对象、结构和概念(内容)。风格与内容的分离对于独立于内容操作图像的风格至关重要,确保了和谐和视觉愉悦的结果。实现这种分离需要对图像的视觉和语义特征有深入的理解,通常需要通过训练专用模型或使用强大的优化来实现。在本文中,我们介绍了B-LoRA,一种利用LoRA(低秩适应)方法隐含地分离单个图像的样式和内容组件的方法,从而轻松完成各种图像风格化任务。通过分析SDXL与LoRA的架构,我们发现,共同学习两个特定模块(被称为B-LoRAs)的LoRA权重确实实现了样式与内容的分离,而通过独立训练每个B-LoRA,无法实现这种样式与内容的分离。将训练合并为两个模块并分离样式和内容,可以大大改善样式操作,克服通常与模型微调相关的过拟合问题。经过训练后,两个B-LoRAs可以作为独立的组件用于各种图像风格化任务,包括图像风格转移、基于文本的图像风格化、一致风格生成和样式与内容的混合。
https://arxiv.org/abs/2403.14572
Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.
视频剪辑涉及对原始视频进行编辑以及附加控制(如文本提示、主题或样式),以生成与原始视频和提供的控制相符合的新视频。传统方法对编辑类型有严格的限制,从而限制了它们满足用户需求的能力。在本文中,我们介绍了AnyV2V,一种新型的无训练免费框架,旨在简化视频编辑,将其分解为两个主要步骤:(1)使用一个标准的图像编辑模型(如InstructPix2Pix,InstantID等)对第一帧进行修改,(2)利用现有的图像到视频生成模型(如I2VGen-XL)进行DDIM反向和特征注入。在第一阶段,AnyV2V可以插入任何现有的图像编辑工具来支持广泛的视频编辑任务。除了传统的提示编辑方法之外,AnyV2V还可以支持新颖的视频编辑任务,包括基于参考的样式迁移、基于主题的编辑和身份 manipulation,这些任务是以前的方法无法实现的。在第二阶段,AnyV2V可以插入任何现有的图像到视频模型来进行DDIM反向和中间特征注入,以保持与原始视频的视觉效果和运动一致性。在提示编辑方面,我们证明了AnyV2V可以在提示对齐和人类偏好的基础上超越最先进的方案,分别提高了35\%和25\%。在三个新颖的任务中,我们也证明了AnyV2V具有很高的成功率。我们相信,由于其将快速发展的图像编辑方法无缝集成,AnyV2V将继续蓬勃发展。这种兼容性可以帮助AnyV2V增加其多样性以满足多样用户需求。
https://arxiv.org/abs/2403.14468
We present novel approaches involving generative adversarial networks and diffusion models in order to synthesize high quality, live and spoof fingerprint images while preserving features such as uniqueness and diversity. We generate live fingerprints from noise with a variety of methods, and we use image translation techniques to translate live fingerprint images to spoof. To generate different types of spoof images based on limited training data we incorporate style transfer techniques through a cycle autoencoder equipped with a Wasserstein metric along with Gradient Penalty (CycleWGAN-GP) in order to avoid mode collapse and instability. We find that when the spoof training data includes distinct spoof characteristics, it leads to improved live-to-spoof translation. We assess the diversity and realism of the generated live fingerprint images mainly through the Fréchet Inception Distance (FID) and the False Acceptance Rate (FAR). Our best diffusion model achieved a FID of 15.78. The comparable WGAN-GP model achieved slightly higher FID while performing better in the uniqueness assessment due to a slightly lower FAR when matched against the training data, indicating better creativity. Moreover, we give example images showing that a DDPM model clearly can generate realistic fingerprint images.
我们提出了涉及生成对抗网络(GAN)和扩散模型的新颖方法,以在保留独特性和多样性特征的同时合成高质量、活体和假体指纹图像。我们使用多种方法从噪声中生成活指纹,并使用图像转换技术将活指纹图像转换为假体。为了根据有限训练数据生成不同类型的假体图像,我们在循环自动编码器(CycleAE)上配备了Wasserstein度量(Wasserstein)和梯度惩罚(CycleWGAN-GP),以避免模式坍塌和稳定性问题。我们发现,当假体训练数据包括显著的假体特征时,会导致活体到假体的翻译更好。我们通过费舍尔切比雪夫距离(FID)和假体接受率(FAR)来评估生成的活指纹图像的多样性和逼真度。我们的最佳扩散模型获得了15.78的FID。与训练数据上 slightly lower FAR 的 WGAN-GP 模型相比,具有更高的独特性评估成绩,表明在创意上表现更好。此外,我们还提供了生成真实指纹图像的示例图像,说明DDPM模型可以生成清晰逼真的指纹图像。
https://arxiv.org/abs/2403.13916
We present a novel method for constructing Variational Autoencoder (VAE). Instead of using pixel-by-pixel loss, we enforce deep feature consistency between the input and the output of a VAE, which ensures the VAE's output to preserve the spatial correlation characteristics of the input, thus leading the output to have a more natural visual appearance and better perceptual quality. Based on recent deep learning works such as style transfer, we employ a pre-trained deep convolutional neural network (CNN) and use its hidden features to define a feature perceptual loss for VAE training. Evaluated on the CelebA face dataset, we show that our model produces better results than other methods in the literature. We also show that our method can produce latent vectors that can capture the semantic information of face expressions and can be used to achieve state-of-the-art performance in facial attribute prediction.
我们提出了一种新的用于构建变分自编码器(VAE)的方法。我们不再使用像素级损失,而是确保VAE的输入和输出之间具有深度特征的 consistency,从而确保VAE的输出保留输入的空间关联特性,从而使输出具有更自然的外观和更好的感知质量。基于最近的学习工作,如风格迁移,我们使用预训练的深度卷积神经网络(CNN)并将其隐藏特征用于定义VAE训练时的特征感知损失。在CelebA面部数据集上进行评估,我们证明了我们的模型在文献中的其他方法中具有更好的性能。我们还证明了我们的方法可以生成具有捕捉面部表情语义信息的潜在向量,并且可以用于实现面部属性的最佳性能。
https://arxiv.org/abs/1610.00291
Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.
尽管基于扩散模型的生成模型可以生成高质量的图像,但先前的作品直接生成整个图像,而无法提供对象级操作能力。为了支持更广泛的现实应用,如专业图形设计和数字艺术,通常在多层中创建和编辑图像以提供更大的灵活性和控制。因此,在本文中,我们提出了一个分层合作扩散模型,名为LayerDiff,专门为文本指导的多层可合成图像生成而设计。可合成图像由背景层、一组前景层和相关掩码层组成每个前景元素。为了实现这一目标,LayerDiff引入了一个基于层级的生成范式,包括多个层级合作注意模块来捕捉层间模式。具体来说,一个层级注意力模块被设计为鼓励层间信息交流和学习,而文本指导的内层注意力模块包括层级特定的提示以指导每个层的具体内容生成。层级特定提示增强模块更好地捕捉全局提示中的详细文本线索。此外,自掩码引导采样策略进一步释放了模型的多层图像生成能力。我们还提出了一个将现有的感知和生成模型集成到一起的生产高质量多层文本指导图像的流水线。大量实验证明,我们的LayerDiff模型可以在性能上与传统整张图像生成方法相媲美。此外,LayerDiff还允许更广泛的可控制生成应用,包括层级特定图像编辑和风格转移。
https://arxiv.org/abs/2403.11929
Previous work has shown that well-crafted adversarial perturbations can threaten the security of video recognition systems. Attackers can invade such models with a low query budget when the perturbations are semantic-invariant, such as StyleFool. Despite the query efficiency, the naturalness of the minutia areas still requires amelioration, since StyleFool leverages style transfer to all pixels in each frame. To close the gap, we propose LocalStyleFool, an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos. Benefiting from the popularity and scalably usability of Segment Anything Model (SAM), we first extract different regions according to semantic information and then track them through the video stream to maintain the temporal consistency. Then, we add style-transfer-based perturbations to several regions selected based on the associative criterion of transfer-based gradient information and regional area. Perturbation fine adjustment is followed to make stylized videos adversarial. We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey, while maintaining competitive fooling rate and query efficiency. Successful experiments on the high-resolution dataset also showcase that scrupulous segmentation of SAM helps to improve the scalability of adversarial attacks under high-resolution data.
以前的工作表明,精心制作的对抗扰动可以威胁视频识别系统的安全性。当扰动对语义不透明时,攻击者可以使用较低的查询预算侵入这些模型,例如StyleFool。尽管查询效率高,但最小区域的自然性仍然需要改进,因为StyleFool利用样式转移来对每个帧中的所有像素进行样式转移。为了弥合这一差距,我们提出了LocalStyleFool,一种改进的视频 adversarial 攻击,它超出了视频的局部样式转移。得益于 Segment Anything Model (SAM) 的流行和可扩展性,我们首先根据语义信息提取不同的区域,然后通过视频流跟踪它们以保持时间一致性。接着,我们将基于转移基于梯度信息的选择性区域添加到几个区域中。通过人类评估的调整来使逼真的视频具有攻击性。我们证明了LocalStyleFool可以通过人类评估的调查提高帧内和帧间的自然性,同时保持竞争力的 fooling 速率和查询效率。在具有高分辨率数据的高分辨率数据集上进行成功的实验,也展示了通过SAM的仔细分割,可以提高高分辨率数据中 adversarial 攻击的可扩展性。
https://arxiv.org/abs/2403.11656
Automation in medical imaging is quite challenging due to the unavailability of annotated datasets and the scarcity of domain experts. In recent years, deep learning techniques have solved some complex medical imaging tasks like disease classification, important object localization, segmentation, etc. However, most of the task requires a large amount of annotated data for their successful implementation. To mitigate the shortage of data, different generative models are proposed for data augmentation purposes which can boost the classification performances. For this, different synthetic medical image data generation models are developed to increase the dataset. Unpaired image-to-image translation models here shift the source domain to the target domain. In the breast malignancy identification domain, FNAC is one of the low-cost low-invasive modalities normally used by medical practitioners. But availability of public datasets in this domain is very poor. Whereas, for automation of cytology images, we need a large amount of annotated data. Therefore synthetic cytology images are generated by translating breast histopathology samples which are publicly available. In this study, we have explored traditional image-to-image transfer models like CycleGAN, and Neural Style Transfer. Further, it is observed that the generated cytology images are quite similar to real breast cytology samples by measuring FID and KID scores.
医学影像自动化领域是一个相当具有挑战性的任务,因为缺乏带注释的数据集和领域专家的数量。近年来,深度学习技术已经解决了一些复杂的医学影像任务,如疾病分类、重要目标定位、分割等。然而,大多数任务需要大量带注释的数据才能成功实现。为缓解数据不足,不同生成模型被提出用于数据增强,以提高分类性能。为此,不同类型的生成医学图像数据生成模型被开发以增加数据集。无配对图像到图像转换模型在这里将源域转移到目标域。在乳腺癌识别领域,FNAC是一种低成本、低侵入性的医学检查手段,但该领域的公开数据集非常缺乏。相反,用于细胞学图像自动化需要大量带注释的数据。因此,通过将乳腺癌病理学样本平移生成细胞学图像,实现了合成细胞学图像。在本研究中,我们探讨了传统图像到图像转移模型,如CycleGAN和Neural Style Transfer。此外,观察到生成的细胞学图像与真实乳腺癌细胞图像在FID和KID分数上相当相似。
https://arxiv.org/abs/2403.10885