This paper handles the problem of converting real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though this problem could be realized by a wide range of image-to-image translation models, a notable issue with all these methods is that the original image content details could be easily erased or corrupted due to transfer of ink-wash style elements. To solve or ameliorate this issue, we propose to incorporate saliency detection into the unpaired image-to-image translation framework to regularize content information of the generated paintings. The saliency map is utilized for content regularization from two aspects, both explicitly and implicitly: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize saliency consistency before and after stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances content integrity of the generated paintings by injecting saliency information to the generator network to guide painting generation. Besides, we also propose saliency attended discriminator network which harnesses saliency mask to focus generative adversarial attention onto salient image regions, it contributes to producing finer ink-wash stylization effect for salient objects of images. Qualitative and quantitative experiments consistently demonstrate superiority of our model over related advanced methods for Chinese ink-wash painting style transfer.
本文处理将真实图像转换成传统中国水墨画的问题,即水墨画风格的图像-to-图像转换问题。尽管这个问题可以通过广泛的图像到图像转换模型来解决,但所有这些方法的一个显着的问题是,由于墨水风格元素的转移,原始图像内容细节很容易被轻松地擦除或损坏。为解决这个问题或减轻这个问题,我们提出将局部可用性检测引入无配对图像到图像转换框架中,以规范化生成的绘画的内容信息。局部可用性图用于从两个方面进行内容规范化:(罗马数字1)我们提出局部可用性IOU(SIOU)损失,以明确规范化和 stylization 之前的和之后的 saliency 一致性;(罗马数字2)我们提出局部可用性自适应规范化(SANorm),它通过向生成网络注入 saliency 信息来增强生成的绘画的内容完整性,从而指导绘画生成。此外,我们还提出了 saliency 注意的判别器网络,它利用 saliency 掩码将生成对抗注意力聚焦于显眼的图像区域,有助于产生更清晰的墨水画风格效果。定性和定量的实验结果都一致地证明了我们的模型在相关先进方法中的优越性。
https://arxiv.org/abs/2404.15743
Creating artistic 3D scenes can be time-consuming and requires specialized knowledge. To address this, recent works such as ARF, use a radiance field-based approach with style constraints to generate 3D scenes that resemble a style image provided by the user. However, these methods lack fine-grained control over the resulting scenes. In this paper, we introduce Controllable Artistic Radiance Fields (CoARF), a novel algorithm for controllable 3D scene stylization. CoARF enables style transfer for specified objects, compositional 3D style transfer and semantic-aware style transfer. We achieve controllability using segmentation masks with different label-dependent loss functions. We also propose a semantic-aware nearest neighbor matching algorithm to improve the style transfer quality. Our extensive experiments demonstrate that CoARF provides user-specified controllability of style transfer and superior style transfer quality with more precise feature matching.
创建艺术化的3D场景可能需要花费时间,并且需要专业知识。为了解决这个问题,最近的工作如ARF,采用基于辐射场的方法,带有风格约束,从用户提供的风格图像中生成类似于用户风格的3D场景。然而,这些方法缺乏对生成场景的细粒度控制。在本文中,我们介绍了可控制的艺术化辐射场(CoARF),一种用于可控制3D场景的风格化的新算法。CoARF允许指定对象的样式转移、合成3D样式转移和语义感知样式转移。我们通过具有不同标签相关损失函数的分割掩码实现可控性。我们还提出了一个语义感知最近邻匹配算法,以提高样式转移质量。我们广泛的实验证明,CoARF提供了用户指定风格转移的可控性和卓越的样式转移质量,具有更精确的特征匹配。
https://arxiv.org/abs/2404.14967
Previous studies on music style transfer have mainly focused on one-to-one style conversion, which is relatively limited. When considering the conversion between multiple styles, previous methods required designing multiple modes to disentangle the complex style of the music, resulting in large computational costs and slow audio generation. The existing music style transfer methods generate spectrograms with artifacts, leading to significant noise in the generated audio. To address these issues, this study proposes a music style transfer framework based on diffusion models (DM) and uses spectrogram-based methods to achieve multi-to-multi music style transfer. The GuideDiff method is used to restore spectrograms to high-fidelity audio, accelerating audio generation speed and reducing noise in the generated audio. Experimental results show that our model has good performance in multi-mode music style transfer compared to the baseline and can generate high-quality audio in real-time on consumer-grade GPUs.
之前关于音乐风格迁移的研究主要集中在一对一的风格转换,这相对有限。当考虑多种风格之间的转换时,以前的方法需要设计多个模式来区分音乐的复杂风格,导致计算成本较大且音频生成速度较慢。现有的音乐风格迁移方法生成的频谱图存在伪影,导致生成的音频中存在较大噪声。为了解决这些问题,本研究基于扩散模型(DM)提出了一种音乐风格迁移框架,并使用频谱图为基础实现多对多音乐风格迁移。GuideDiff方法被用于将频谱图恢复到高保真音频,加速音频生成速度并减少生成的音频中的噪声。实验结果表明,与基线相比,我们的模型在多模式音乐风格迁移方面具有较好的性能,可以在消费级GPU上实时生成高质量音频。
https://arxiv.org/abs/2404.14771
This paper presents a novel contribution to the field of regional style transfer. Existing methods often suffer from the drawback of applying style homogeneously across the entire image, leading to stylistic inconsistencies or foreground object twisted when applied to image with foreground elements such as person figures. To address this limitation, we propose a new approach that leverages a segmentation network to precisely isolate foreground objects within the input image. Subsequently, style transfer is applied exclusively to the background region. The isolated foreground objects are then carefully reintegrated into the style-transferred background. To enhance the visual coherence between foreground and background, a color transfer step is employed on the foreground elements prior to their rein-corporation. Finally, we utilize feathering techniques to achieve a seamless amalgamation of foreground and background, resulting in a visually unified and aesthetically pleasing final composition. Extensive evaluations demonstrate that our proposed approach yields significantly more natural stylistic transformations compared to conventional methods.
本文在区域风格迁移领域做出了一个新颖的贡献。现有的方法通常存在一个问题,即在整张图像上应用相同的风格,导致风格的不一致性,或者在将风格应用于具有前景元素(如人物形象)的图像时,出现前景对象扭曲。为了应对这个局限,我们提出了一个新的方法,该方法利用分割网络精确地将在输入图像中隔离前景对象。随后,将风格应用于背景区域。隔离后的前景对象 then 被小心地重新整合到风格转移后的背景中。为了增强前景和背景之间的视觉连贯性,在它们重新合并之前,对前景元素进行了颜色转移。最后,我们利用羽化技术实现前景和背景的无缝混合,从而产生视觉上统一和美观的最终构图。 extensive 评估证明,与传统方法相比,我们所提出的方法产生了更自然、更美观的样式转换效果。
https://arxiv.org/abs/2404.13880
Arbitrary style transfer holds widespread attention in research and boasts numerous practical applications. The existing methods, which either employ cross-attention to incorporate deep style attributes into content attributes or use adaptive normalization to adjust content features, fail to generate high-quality stylized images. In this paper, we introduce an innovative technique to improve the quality of stylized images. Firstly, we propose Style Consistency Instance Normalization (SCIN), a method to refine the alignment between content and style features. In addition, we have developed an Instance-based Contrastive Learning (ICL) approach designed to understand the relationships among various styles, thereby enhancing the quality of the resulting stylized images. Recognizing that VGG networks are more adept at extracting classification features and need to be better suited for capturing style features, we have also introduced the Perception Encoder (PE) to capture style features. Extensive experiments demonstrate that our proposed method generates high-quality stylized images and effectively prevents artifacts compared with the existing state-of-the-art methods.
任意风格迁移在研究和实践中具有广泛的关注,并拥有许多实际应用。现有方法中,e要么采用跨注意来将深度风格属性融入内容属性,要么使用自适应归一化来调整内容特征,都无法生成高质量的风格化图像。在本文中,我们提出了改进风格化图像质量的新技术。首先,我们提出了Style Consistency Instance Normalization(SCIN)方法,这是一种优化内容与风格特征之间对齐的方法。此外,我们还开发了一种基于实例的对比学习(ICL)方法,旨在理解各种风格之间的关系,从而提高生成风格化图像的质量。认识到VGG网络更擅长提取分类特征,需要更好地适应捕捉风格特征,我们还引入了感知编码器(PE)以捕捉风格特征。大量实验证明,与现有最先进的方法相比,我们提出的方法生成的风格化图像质量高,并有效防止了伪影。
https://arxiv.org/abs/2404.13584
Transforming two-dimensional (2D) images into three-dimensional (3D) volumes is a well-known yet challenging problem for the computer vision community. In the medical domain, a few previous studies attempted to convert two or more input radiographs into computed tomography (CT) volumes. Following their effort, we introduce a diffusion model-based technology that can rotate the anatomical content of any input radiograph in 3D space, potentially enabling the visualization of the entire anatomical content of the radiograph from any viewpoint in 3D. Similar to previous studies, we used CT volumes to create Digitally Reconstructed Radiographs (DRRs) as the training data for our model. However, we addressed two significant limitations encountered in previous studies: 1. We utilized conditional diffusion models with classifier-free guidance instead of Generative Adversarial Networks (GANs) to achieve higher mode coverage and improved output image quality, with the only trade-off being slower inference time, which is often less critical in medical applications; and 2. We demonstrated that the unreliable output of style transfer deep learning (DL) models, such as Cycle-GAN, to transfer the style of actual radiographs to DRRs could be replaced with a simple yet effective training transformation that randomly changes the pixel intensity histograms of the input and ground-truth imaging data during training. This transformation makes the diffusion model agnostic to any distribution variations of the input data pixel intensity, enabling the reliable training of a DL model on input DRRs and applying the exact same model to conventional radiographs (or DRRs) during inference.
将二维(2D)图像转换为三维(3D)体积在计算机视觉领域是一个众所周知但具有挑战性的问题。在医学领域,之前的一些研究表明,将两张或多张输入X光片转换为计算机断层扫描(CT)体积是可能的。他们的努力之后,我们引入了一种基于扩散模型的技术,该技术可以旋转任何输入X光片在3D空间中的解剖内容,从而有可能从任何角度观察到整个X光片的解剖内容。与之前的研究类似,我们使用CT体积来创建数字重建X光片(DRRs)作为模型的训练数据。然而,我们在之前的研究中遇到了两个显著的局限性:1.我们使用条件扩散模型(无分类指导)而不是生成对抗网络(GANs)来实现更高的模态覆盖和改善的输出图像质量,唯一的代价是推理时间更快,这在医疗应用中并不关键;2.我们证明了将深度学习模型(如循环神经网络)的风格迁移到DRRs的不确定输出可以被简单而有效的训练转换所取代,该转换在训练过程中随机改变输入和真实成像数据的像素强度直方图。这种转换使扩散模型对输入数据的像素强度分布变化具有免疫力,从而能够可靠地对DL模型在DRRs上的训练以及对常规X光片(或DRRs)的应用进行相同的模型。
https://arxiv.org/abs/2404.13000
Artistic style transfer aims to transfer the learned artistic style onto an arbitrary content image, generating artistic stylized images. Existing generative adversarial network-based methods fail to generate highly realistic stylized images and always introduce obvious artifacts and disharmonious patterns. Recently, large-scale pre-trained diffusion models opened up a new way for generating highly realistic artistic stylized images. However, diffusion model-based methods generally fail to preserve the content structure of input content images well, introducing some undesired content structure and style patterns. To address the above problems, we propose a novel pre-trained diffusion-based artistic style transfer method, called LSAST, which can generate highly realistic artistic stylized images while preserving the content structure of input content images well, without bringing obvious artifacts and disharmonious style patterns. Specifically, we introduce a Step-aware and Layer-aware Prompt Space, a set of learnable prompts, which can learn the style information from the collection of artworks and dynamically adjusts the input images' content structure and style pattern. To train our prompt space, we propose a novel inversion method, called Step-ware and Layer-aware Prompt Inversion, which allows the prompt space to learn the style information of the artworks collection. In addition, we inject a pre-trained conditional branch of ControlNet into our LSAST, which further improved our framework's ability to maintain content structure. Extensive experiments demonstrate that our proposed method can generate more highly realistic artistic stylized images than the state-of-the-art artistic style transfer methods.
艺术风格迁移的目的是将学习到的艺术风格应用到任意内容图像上,生成艺术风格化的图像。现有的基于生成对抗网络(GAN)的方法无法生成高度逼真的艺术风格化图像,并始终引入明显的伪影和失调模式。最近,大型预训练扩散模型为生成高度逼真的艺术风格化图像开辟了新的途径。然而,扩散模型方法通常无法保留输入内容图像的內容結構,引入一些不希望的内容结构和样式模式。为了解决上述问题,我们提出了一个新颖的预训练扩散-基于的艺术风格迁移方法,称为LSAST,可以在保留输入内容图像的內容结构的同时生成高度逼真的艺术风格化图像,不会引入明显的伪影和失调样式模式。具体来说,我们引入了一个步幅感知和层感知提示空间,一系列可学习的提示,可以从艺术作品集中学习风格信息,并动态调整输入图像的內容结构和样式模式。为了训练我们的提示空间,我们提出了一个新颖的翻转方法,称为步幅感知和层感知提示翻转,允许提示空间学习艺术作品集的样式信息。此外,我们将预训练的控制网分支注入到我们的LSAST中,进一步提高了我们的框架保持内容结构的能力。大量实验证明,与最先进的艺术风格迁移方法相比,我们所提出的方法可以生成更高度逼真的艺术风格化图像。
https://arxiv.org/abs/2404.11474
This research paper proposes a novel methodology for image-to-image style transfer on objects utilizing a single deep convolutional neural network. The proposed approach leverages the You Only Look Once version 8 (YOLOv8) segmentation model and the backbone neural network of YOLOv8 for style transfer. The primary objective is to enhance the visual appeal of objects in images by seamlessly transferring artistic styles while preserving the original object characteristics. The proposed approach's novelty lies in combining segmentation and style transfer in a single deep convolutional neural network. This approach omits the need for multiple stages or models, thus resulting in simpler training and deployment of the model for practical applications. The results of this approach are shown on two content images by applying different style images. The paper also demonstrates the ability to apply style transfer on multiple objects in the same image.
本文提出了一种利用单个深度卷积神经网络进行图像到图像风格迁移的新方法来处理物体。所提出的方法利用了You Only Look Once版本8(YOLOv8)分割模型和YOLOv8的骨干网络来进行风格迁移。主要目标是通过无缝转移艺术风格来增强图像中物体的视觉吸引力,同时保留原始物体的特征。该方法的创新之处在于将分割和风格迁移结合在一个单深的卷积神经网络中。这种方法省略了多个阶段或模型,因此简化了模型的训练和部署,为实际应用提供了更简单的模型。通过在两个内容图像上应用不同的风格图像,展示了这种方法的效果。本文还展示了在同一图像中应用风格迁移处理多个物体的能力。
https://arxiv.org/abs/2404.09461
In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects.
在这项工作中,我们针对文本到图像(T2I)扩散模型的文本驱动风格迁移任务。主要挑战是在保持一致的结构同时实现有效的风格迁移效果。该领域过去的做法是将提示级的风格注入直接连接内容和风格提示,导致结构变形不可避免。在这项工作中,我们提出了一种新的解决方案,称为自适应风格集成(ASI),以实现细粒度的特征级别风格集成。它包括 Siamese Cross-Attention(SiCA)将单道跨注意解耦为双道结构以获得单独的内容和风格特征,以及自适应内容风格混合(AdaBlending)模块以以结构一致的方式将内容和风格信息耦合。实验证明,我们的方法在结构和风格保留方面都表现出更好的性能。
https://arxiv.org/abs/2404.06835
Recently, a surge of 3D style transfer methods has been proposed that leverage the scene reconstruction power of a pre-trained neural radiance field (NeRF). To successfully stylize a scene this way, one must first reconstruct a photo-realistic radiance field from collected images of the scene. However, when only sparse input views are available, pre-trained few-shot NeRFs often suffer from high-frequency artifacts, which are generated as a by-product of high-frequency details for improving reconstruction quality. Is it possible to generate more faithful stylized scenes from sparse inputs by directly optimizing encoding-based scene representation with target style? In this paper, we consider the stylization of sparse-view scenes in terms of disentangling content semantics and style textures. We propose a coarse-to-fine sparse-view scene stylization framework, where a novel hierarchical encoding-based neural representation is designed to generate high-quality stylized scenes directly from implicit scene representations. We also propose a new optimization strategy with content strength annealing to achieve realistic stylization and better content preservation. Extensive experiments demonstrate that our method can achieve high-quality stylization of sparse-view scenes and outperforms fine-tuning-based baselines in terms of stylization quality and efficiency.
最近,提出了一种利用预训练神经辐射场(NeRF)场景重构能力的3D风格迁移方法。要成功使用这种方法来风格化场景,首先需要从收集的场景图片中重构出照片真实的辐射场。然而,当仅有的输入视图稀疏时,预训练的少样本NeRFs通常会受到高频噪声的影响,这是由于为了提高重建质量而产生的高频细节。是否可以在稀疏输入下通过直接优化基于场景表示的编码来生成更忠实风格的场景呢?在本文中,我们考虑了通过解开内容语义和风格纹理来对稀疏视图场景进行风格化。我们提出了一个粗-细稀疏视图场景风格化框架,其中一种新颖的层次编码基于神经表示旨在直接生成高质量的风格化场景。我们还提出了一个新的优化策略——内容强度衰减,以实现逼真的风格化和更好的内容保留。大量实验证明,我们的方法可以在稀疏视图场景中实现高品质的风格化,并且在 stylization质量和效率方面优于基于微调的基线。
https://arxiv.org/abs/2404.05236
With the rapid development of XR, 3D generation and editing are becoming more and more important, among which, stylization is an important tool of 3D appearance editing. It can achieve consistent 3D artistic stylization given a single reference style image and thus is a user-friendly editing way. However, recent NeRF-based 3D stylization methods face efficiency issues that affect the actual user experience and the implicit nature limits its ability to transfer the geometric pattern styles. Additionally, the ability for artists to exert flexible control over stylized scenes is considered highly desirable, fostering an environment conducive to creative exploration. In this paper, we introduce StylizedGS, a 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting (3DGS) representation. The 3DGS brings the benefits of high efficiency. We propose a GS filter to eliminate floaters in the reconstruction which affects the stylization effects before stylization. Then the nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale and regions during the stylization to possess customized capabilities. Our method can attain high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference FPS.
随着XR技术的快速发展,3D生成和编辑变得越来越重要,其中,塑性是一种重要的3D外观编辑工具。通过给定单一定式风格图像,它可以实现一致的3D艺术塑性,从而成为一种用户友好的编辑方式。然而,基于NeRF的3D塑性方法面临效率问题,影响了实际用户体验,并且隐含的拓扑学限制了其将几何图案样式转移的能力。此外,艺术家对塑性场景的灵活控制被认为是高度渴望的,促进了创意探索的环境。在本文中,我们引入了StylizedGS,一种基于3D高斯平滑(3DGS)表示的3D神经风格转移框架。3DGS带来了高效率的优势。我们提出了GS滤波器来消除在重构过程中影响塑性效果的浮点。然后,基于最近邻的样式损失来实现通过对3DGS的形和色参数的微调来实现塑性,同时引入了基于其他正则化的深度保持损失,以防止对几何内容进行篡改。此外,通过专门设计的损失功能,StylizedGS使用户能够在塑性过程中控制颜色、塑性比例和区域,具有自定义功能。我们的方法可以实现高质量、具有忠实笔触和几何一致性的艺术塑性,具有灵活的控制。在各种场景和风格的大量实验中,我们证明了我们的方法在塑性和推理每秒帧数方面的有效性和效率。
https://arxiv.org/abs/2404.05220
We propose a novel approach to improve the reproducibility of neuroimaging results by converting statistic maps across different functional MRI pipelines. We make the assumption that pipelines can be considered as a style component of data and propose to use different generative models, among which, Diffusion Models (DM) to convert data between pipelines. We design a new DM-based unsupervised multi-domain image-to-image transition framework and constrain the generation of 3D fMRI statistic maps using the latent space of an auxiliary classifier that distinguishes statistic maps from different pipelines. We extend traditional sampling techniques used in DM to improve the transition performance. Our experiments demonstrate that our proposed methods are successful: pipelines can indeed be transferred, providing an important source of data augmentation for future medical studies.
我们提出了一个新方法来提高神经影像结果的可重复性,通过将不同功能MRI流程中的统计图进行转换。我们假设流程可以被视为数据的一种样式组件,并提出使用不同的生成模型,其中扩散模型(DM)用于在流程之间转换数据。我们设计了一个基于扩散模型的多域图像到图像无监督转换框架,并通过辅助分类器的潜在空间对3D fMRI统计图进行生成限制。我们扩展了在DM中使用的传统采样技术,以提高转换性能。我们的实验结果表明,我们所提出的方法是成功的:流程确实可以转移,为未来的医学研究提供了重要的数据增强来源。
https://arxiv.org/abs/2404.03703
Style transfer is a promising approach to close the sim-to-real gap in medical endoscopy. Rendering realistic endoscopic videos by traversing pre-operative scans (such as MRI or CT) can generate realistic simulations as well as ground truth camera poses and depth maps. Although image-to-image (I2I) translation models such as CycleGAN perform well, they are unsuitable for video-to-video synthesis due to the lack of temporal consistency, resulting in artifacts between frames. We propose MeshBrush, a neural mesh stylization method to synthesize temporally consistent videos with differentiable rendering. MeshBrush uses the underlying geometry of patient imaging data while leveraging existing I2I methods. With learned per-vertex textures, the stylized mesh guarantees consistency while producing high-fidelity outputs. We demonstrate that mesh stylization is a promising approach for creating realistic simulations for downstream tasks such as training and preoperative planning. Although our method is tested and designed for ureteroscopy, its components are transferable to general endoscopic and laparoscopic procedures.
风格迁移是一种有效的解决医疗内窥镜模拟与现实差距的方法。通过通过术前扫描(如MRI或CT)生成逼真的内窥镜视频以及真实相机姿态和深度图,可以生成逼真的模拟。尽管图像到图像(I2I)变换模型如CycleGAN表现良好,但由于缺乏时间一致性,导致帧间出现伪影。我们提出MeshBrush,一种基于神经网络的网格纹理化方法,用于合成具有不同程度渲染的逼真的视频。MeshBrush利用患者成像数据的底层几何,同时利用现有的I2I方法。通过学习每个顶点的纹理,纹理化的网格确保了一致性,并产生了高保真的输出。我们证明了网格纹理化是一种有前途的方法,可用于为下游任务(如培训和术前计划)创建逼真的模拟。尽管我们的方法已针对内窥镜超声检查进行了测试和设计,但它的组件可应用于其他内窥镜和腹腔镜手术。
https://arxiv.org/abs/2404.02999
Foundation models have emerged as pivotal tools, tackling many complex tasks through pre-training on vast datasets and subsequent fine-tuning for specific applications. The Segment Anything Model is one of the first and most well-known foundation models for computer vision segmentation tasks. This work presents a multi-faceted red-teaming analysis that tests the Segment Anything Model against challenging tasks: (1) We analyze the impact of style transfer on segmentation masks, demonstrating that applying adverse weather conditions and raindrops to dashboard images of city roads significantly distorts generated masks. (2) We focus on assessing whether the model can be used for attacks on privacy, such as recognizing celebrities' faces, and show that the model possesses some undesired knowledge in this task. (3) Finally, we check how robust the model is to adversarial attacks on segmentation masks under text prompts. We not only show the effectiveness of popular white-box attacks and resistance to black-box attacks but also introduce a novel approach - Focused Iterative Gradient Attack (FIGA) that combines white-box approaches to construct an efficient attack resulting in a smaller number of modified pixels. All of our testing methods and analyses indicate a need for enhanced safety measures in foundation models for image segmentation.
基础模型已经成为解决计算机视觉分割任务的关键工具,通过在庞大的数据集上进行预训练,然后针对特定应用进行微调,从而解决许多复杂任务。Segment Anything Model 是第一个也是最有名的基础模型之一,用于计算机视觉分割任务。这项工作提出了一种多方面的协同分析,针对 Segment Anything Model 进行挑战性任务: (1)我们分析了风格迁移对分割掩码的影响,证明了将不利天气条件和雨滴应用于城市道路仪表板图像会显著扭曲生成的掩码。 (2)我们关注评估模型是否可以用于隐私攻击,例如识别名人面容,并表明模型在这方面拥有一些不良知识。 (3)最后,我们检查了模型在文本提示下对分割掩码的抗攻击性。我们不仅展示了流行的高级攻击方法和对抗黑盒攻击的有效性,而且引入了一种新方法——专注于迭代攻击(FIGA),将白盒攻击方法结合构建出高效的攻击方式,从而实现更少的修改像素数量。我们所有的测试方法和分析都表明,基础模型在图像分割方面的安全性需要得到提高。
https://arxiv.org/abs/2404.02067
Text detoxification is a textual style transfer (TST) task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recently, text detoxification methods found their applications in various task such as detoxification of Large Language Models (LLMs) (Leong et al., 2023; He et al., 2024; Tang et al., 2023) and toxic speech combating in social networks (Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023). All these applications are extremely important to ensure safe communication in modern digital worlds. However, the previous approaches for parallel text detoxification corpora collection -- ParaDetox (Logacheva et al., 2022) and APPADIA (Atwell et al., 2022) -- were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language. Then, we experiment with different text detoxification models -- from unsupervised baselines to LLMs and fine-tuned models on the presented parallel corpora -- showing the great benefit of parallel corpus presence to obtain state-of-the-art text detoxification models for any language.
文本去毒是一种文本风格迁移(TST)任务,其中从具有恶意的文本形式(例如包含粗言的文本)将其转移到中立语域。最近,文本去毒方法在各种任务中的应用得到了发现,例如对大型语言模型(LLMs)的净化(Leong et al., 2023; He et al., 2024; Tang et al., 2023)和社会网络中的有毒言论对抗(Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023)。所有这些应用都非常重要,以确保现代数字世界中安全交流。然而,之前用于并行文本去毒数据集的构建的方法——ParaDetox(Logacheva et al., 2022)和APPADIA(Atwell et al., 2022)——仅在单语种设置中进行研究。在这项工作中,我们旨在将ParaDetox管道扩展到多种语言,并使用MultiParaDetox在多个语言上自动构建并行去毒数据集,从而实现对任何语言的平行文本去毒。然后,我们实验了不同的文本去毒模型——从无监督的基线到LLM和微调模型——展示了平行数据集存在的巨大好处,为任何语言获取最先进文本去毒模型。
https://arxiv.org/abs/2404.02037
Our paper addresses the complex task of transferring a hairstyle from a reference image to an input photo for virtual hair try-on. This task is challenging due to the need to adapt to various photo poses, the sensitivity of hairstyles, and the lack of objective metrics. The current state of the art hairstyle transfer methods use an optimization process for different parts of the approach, making them inexcusably slow. At the same time, faster encoder-based models are of very low quality because they either operate in StyleGAN's W+ space or use other low-dimensional image generators. Additionally, both approaches have a problem with hairstyle transfer when the source pose is very different from the target pose, because they either don't consider the pose at all or deal with it inefficiently. In our paper, we present the HairFast model, which uniquely solves these problems and achieves high resolution, near real-time performance, and superior reconstruction compared to optimization problem-based methods. Our solution includes a new architecture operating in the FS latent space of StyleGAN, an enhanced inpainting approach, and improved encoders for better alignment, color transfer, and a new encoder for post-processing. The effectiveness of our approach is demonstrated on realism metrics after random hairstyle transfer and reconstruction when the original hairstyle is transferred. In the most difficult scenario of transferring both shape and color of a hairstyle from different images, our method performs in less than a second on the Nvidia V100. Our code is available at this https URL.
我们的论文解决了将发型从参考图像转移到输入照片进行虚拟试戴的复杂任务。由于需要适应各种照片姿势、发型敏感度和缺乏客观指标,这项任务非常具有挑战性。同时,目前基于优化过程的方法速度极慢,而快速编码的模型质量也非常低,因为它们要么在StyleGAN的W+空间运行,要么使用其他低维图像生成器。此外,两种方法在发型转移时都存在问题,因为它们要么完全忽略姿势,要么处理姿势不高效。在我们的论文中,我们提出了HairFast模型,它独特地解决了这些问题,并实现了高分辨率、接近实时性能和卓越的重建效果,与基于优化问题的方法相比。我们的解决方案包括一个新的在StyleGAN的FS潜在空间中运行的架构、增强的修复方法、改进的编码器以及后处理的新编码器。在随机发型转移和重建时,我们的方法在现实主义指标上证明了其有效性。在将不同图像的发型和颜色从一个图像转移到另一个图像的最困难的情况下,我们的方法在Nvidia V100上执行的时间不到一秒。我们的代码可在此处访问:https:// this URL。
https://arxiv.org/abs/2404.01094
In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization; for some of these tasks, there is no comparable previous 3D method. In many cases, we even outperform state-of-the-art methods specialized for the task in question. Moreover, Lift3D is a zero-shot method, in the sense that it requires no task-specific training, nor scene-specific optimization.
近年来,随着大型2D图像数据集的爆炸性增长,2D视觉模型在语义分割、风格迁移或场景编辑等任务上取得了显著突破。同时,对于多视角图像中的3D场景表示,如来自多视角图像的神经辐射场,3D场景表示的可用性仍然相对有限,这使得将2D视觉模型扩展到3D数据非常诱人,但同时也非常具有挑战性。事实上,将单个2D视觉操作扩展到3D通常需要高度创造性方法的专业领域,并且通常需要针对每个场景进行优化。在本文中,我们问是否可以提升任何2D视觉模型使其在3D中做出一致的预测。我们得出结论:是的,我们的新Lift3D方法训练预测由几个视觉模型(即DINO和CLIP)生成的特征空间中的未见过的视图,但 then 扩展到新颖的视觉操作和任务,如风格迁移、超分辨率、开词汇分割和图像色度;对于某些任务,没有 comparable之前的3D方法。在许多情况下,我们甚至超过了针对该任务的最佳3D方法。此外,Lift3D是一种零 shot方法,这意味着它不需要任务特定训练,也不需要场景特定优化。
https://arxiv.org/abs/2403.18922
Image style transfer aims to imbue digital imagery with the distinctive attributes of style targets, such as colors, brushstrokes, shapes, whilst concurrently preserving the semantic integrity of the content. Despite the advancements in arbitrary style transfer methods, a prevalent challenge remains the delicate equilibrium between content semantics and style attributes. Recent developments in large-scale text-to-image diffusion models have heralded unprecedented synthesis capabilities, albeit at the expense of relying on extensive and often imprecise textual descriptions to delineate artistic styles. Addressing these limitations, this paper introduces DiffStyler, a novel approach that facilitates efficient and precise arbitrary image style transfer. DiffStyler lies the utilization of a text-to-image Stable Diffusion model-based LoRA to encapsulate the essence of style targets. This approach, coupled with strategic cross-LoRA feature and attention injection, guides the style transfer process. The foundation of our methodology is rooted in the observation that LoRA maintains the spatial feature consistency of UNet, a discovery that further inspired the development of a mask-wise style transfer technique. This technique employs masks extracted through a pre-trained FastSAM model, utilizing mask prompts to facilitate feature fusion during the denoising process, thereby enabling localized style transfer that preserves the original image's unaffected regions. Moreover, our approach accommodates multiple style targets through the use of corresponding masks. Through extensive experimentation, we demonstrate that DiffStyler surpasses previous methods in achieving a more harmonious balance between content preservation and style integration.
图像风格迁移的目的是将数字图像赋予具有风格目标特征的鲜艳色彩、笔触、形状等,同时保留内容的语义完整性。尽管在任意风格迁移方法上取得了进步,但仍然存在一个普遍的挑战,即内容语义和风格属性之间的微妙的平衡。近年来,大规模文本到图像扩散模型的发展预示着前所未有的合成能力,但代价是依赖广泛的且经常不精确的文本描述来定义艺术风格。为解决这些局限,本文引入了DiffStyler,一种新方法,可实现高效且精确的任意图像风格迁移。DiffStyler利用基于文本到图像的稳定扩散模型(LoRA)来封装风格目标的本质。这种方法与策略的跨LoRA特征和注意注入相结合,引导风格迁移过程。我们方法的基础是观察到LoRA保持UNet的空间特征一致性,这一发现进一步激发了通过掩码级风格迁移技术的发展。这种技术利用预训练的FastSAM模型提取掩码,在去噪过程中利用掩码提示促进特征融合,从而实现局部风格迁移,保留原始图像不受影响区域。此外,通过使用相应的掩码,我们的方法可以适应多种风格目标。通过大量实验,我们证明了DiffStyler在实现内容保护和风格整合的更和谐平衡方面超越了以前的方法。
https://arxiv.org/abs/2403.18461
In Virtual Reality (VR), adversarial attack remains a significant security threat. Most deep learning-based methods for physical and digital adversarial attacks focus on enhancing attack performance by crafting adversarial examples that contain large printable distortions that are easy for human observers to identify. However, attackers rarely impose limitations on the naturalness and comfort of the appearance of the generated attack image, resulting in a noticeable and unnatural attack. To address this challenge, we propose a framework to incorporate style transfer to craft adversarial inputs of natural styles that exhibit minimal detectability and maximum natural appearance, while maintaining superior attack capabilities.
在虚拟现实(VR)中,对抗性攻击仍然是一个重要的安全威胁。大多数基于深度学习的物理和数字对抗性攻击方法都集中精力通过构建包含大量可打印的变形实例的对抗性示例来提高攻击性能。然而,攻击者很少对生成攻击图像的自然性和舒适性施加限制,导致了一种明显的不自然且不可见的攻击。为了应对这个挑战,我们提出了一个框架,将风格迁移应用于自然风格的数据,以生成具有最小检测性和最大自然外观的攻击输入,同时保持卓越的攻击能力。
https://arxiv.org/abs/2403.14778
Image stylization involves manipulating the visual appearance and texture (style) of an image while preserving its underlying objects, structures, and concepts (content). The separation of style and content is essential for manipulating the image's style independently from its content, ensuring a harmonious and visually pleasing result. Achieving this separation requires a deep understanding of both the visual and semantic characteristics of images, often necessitating the training of specialized models or employing heavy optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA (Low-Rank Adaptation) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of SDXL combined with LoRA, we find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. Consolidating the training into only two blocks and separating style and content allows for significantly improving style manipulation and overcoming overfitting issues often associated with model fine-tuning. Once trained, the two B-LoRAs can be used as independent components to allow various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing.
图像风格化涉及对图像的视觉外观和质感(风格)进行操作,同时保留其潜在的对象、结构和概念(内容)。风格与内容的分离对于独立于内容操作图像的风格至关重要,确保了和谐和视觉愉悦的结果。实现这种分离需要对图像的视觉和语义特征有深入的理解,通常需要通过训练专用模型或使用强大的优化来实现。在本文中,我们介绍了B-LoRA,一种利用LoRA(低秩适应)方法隐含地分离单个图像的样式和内容组件的方法,从而轻松完成各种图像风格化任务。通过分析SDXL与LoRA的架构,我们发现,共同学习两个特定模块(被称为B-LoRAs)的LoRA权重确实实现了样式与内容的分离,而通过独立训练每个B-LoRA,无法实现这种样式与内容的分离。将训练合并为两个模块并分离样式和内容,可以大大改善样式操作,克服通常与模型微调相关的过拟合问题。经过训练后,两个B-LoRAs可以作为独立的组件用于各种图像风格化任务,包括图像风格转移、基于文本的图像风格化、一致风格生成和样式与内容的混合。
https://arxiv.org/abs/2403.14572