Image style transfer aims to imbue digital imagery with the distinctive attributes of style targets, such as colors, brushstrokes, shapes, whilst concurrently preserving the semantic integrity of the content. Despite the advancements in arbitrary style transfer methods, a prevalent challenge remains the delicate equilibrium between content semantics and style attributes. Recent developments in large-scale text-to-image diffusion models have heralded unprecedented synthesis capabilities, albeit at the expense of relying on extensive and often imprecise textual descriptions to delineate artistic styles. Addressing these limitations, this paper introduces DiffStyler, a novel approach that facilitates efficient and precise arbitrary image style transfer. DiffStyler lies the utilization of a text-to-image Stable Diffusion model-based LoRA to encapsulate the essence of style targets. This approach, coupled with strategic cross-LoRA feature and attention injection, guides the style transfer process. The foundation of our methodology is rooted in the observation that LoRA maintains the spatial feature consistency of UNet, a discovery that further inspired the development of a mask-wise style transfer technique. This technique employs masks extracted through a pre-trained FastSAM model, utilizing mask prompts to facilitate feature fusion during the denoising process, thereby enabling localized style transfer that preserves the original image's unaffected regions. Moreover, our approach accommodates multiple style targets through the use of corresponding masks. Through extensive experimentation, we demonstrate that DiffStyler surpasses previous methods in achieving a more harmonious balance between content preservation and style integration.
图像风格迁移的目的是将数字图像赋予具有风格目标特征的鲜艳色彩、笔触、形状等,同时保留内容的语义完整性。尽管在任意风格迁移方法上取得了进步,但仍然存在一个普遍的挑战,即内容语义和风格属性之间的微妙的平衡。近年来,大规模文本到图像扩散模型的发展预示着前所未有的合成能力,但代价是依赖广泛的且经常不精确的文本描述来定义艺术风格。为解决这些局限,本文引入了DiffStyler,一种新方法,可实现高效且精确的任意图像风格迁移。DiffStyler利用基于文本到图像的稳定扩散模型(LoRA)来封装风格目标的本质。这种方法与策略的跨LoRA特征和注意注入相结合,引导风格迁移过程。我们方法的基础是观察到LoRA保持UNet的空间特征一致性,这一发现进一步激发了通过掩码级风格迁移技术的发展。这种技术利用预训练的FastSAM模型提取掩码,在去噪过程中利用掩码提示促进特征融合,从而实现局部风格迁移,保留原始图像不受影响区域。此外,通过使用相应的掩码,我们的方法可以适应多种风格目标。通过大量实验,我们证明了DiffStyler在实现内容保护和风格整合的更和谐平衡方面超越了以前的方法。
https://arxiv.org/abs/2403.18461
In Virtual Reality (VR), adversarial attack remains a significant security threat. Most deep learning-based methods for physical and digital adversarial attacks focus on enhancing attack performance by crafting adversarial examples that contain large printable distortions that are easy for human observers to identify. However, attackers rarely impose limitations on the naturalness and comfort of the appearance of the generated attack image, resulting in a noticeable and unnatural attack. To address this challenge, we propose a framework to incorporate style transfer to craft adversarial inputs of natural styles that exhibit minimal detectability and maximum natural appearance, while maintaining superior attack capabilities.
在虚拟现实(VR)中,对抗性攻击仍然是一个重要的安全威胁。大多数基于深度学习的物理和数字对抗性攻击方法都集中精力通过构建包含大量可打印的变形实例的对抗性示例来提高攻击性能。然而,攻击者很少对生成攻击图像的自然性和舒适性施加限制,导致了一种明显的不自然且不可见的攻击。为了应对这个挑战,我们提出了一个框架,将风格迁移应用于自然风格的数据,以生成具有最小检测性和最大自然外观的攻击输入,同时保持卓越的攻击能力。
https://arxiv.org/abs/2403.14778
Image stylization involves manipulating the visual appearance and texture (style) of an image while preserving its underlying objects, structures, and concepts (content). The separation of style and content is essential for manipulating the image's style independently from its content, ensuring a harmonious and visually pleasing result. Achieving this separation requires a deep understanding of both the visual and semantic characteristics of images, often necessitating the training of specialized models or employing heavy optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA (Low-Rank Adaptation) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of SDXL combined with LoRA, we find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. Consolidating the training into only two blocks and separating style and content allows for significantly improving style manipulation and overcoming overfitting issues often associated with model fine-tuning. Once trained, the two B-LoRAs can be used as independent components to allow various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing.
图像风格化涉及对图像的视觉外观和质感(风格)进行操作,同时保留其潜在的对象、结构和概念(内容)。风格与内容的分离对于独立于内容操作图像的风格至关重要,确保了和谐和视觉愉悦的结果。实现这种分离需要对图像的视觉和语义特征有深入的理解,通常需要通过训练专用模型或使用强大的优化来实现。在本文中,我们介绍了B-LoRA,一种利用LoRA(低秩适应)方法隐含地分离单个图像的样式和内容组件的方法,从而轻松完成各种图像风格化任务。通过分析SDXL与LoRA的架构,我们发现,共同学习两个特定模块(被称为B-LoRAs)的LoRA权重确实实现了样式与内容的分离,而通过独立训练每个B-LoRA,无法实现这种样式与内容的分离。将训练合并为两个模块并分离样式和内容,可以大大改善样式操作,克服通常与模型微调相关的过拟合问题。经过训练后,两个B-LoRAs可以作为独立的组件用于各种图像风格化任务,包括图像风格转移、基于文本的图像风格化、一致风格生成和样式与内容的混合。
https://arxiv.org/abs/2403.14572
Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free framework designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model (e.g. InstructPix2Pix, InstantID, etc) to modify the first frame, (2) utilizing an existing image-to-video generation model (e.g. I2VGen-XL) for DDIM inversion and feature injection. In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing methods, AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video. On the prompt-based editing, we show that AnyV2V can outperform the previous best approach by 35\% on prompt alignment, and 25\% on human preference. On the three novel tasks, we show that AnyV2V also achieves a high success rate. We believe AnyV2V will continue to thrive due to its ability to seamlessly integrate the fast-evolving image editing methods. Such compatibility can help AnyV2V to increase its versatility to cater to diverse user demands.
视频剪辑涉及对原始视频进行编辑以及附加控制(如文本提示、主题或样式),以生成与原始视频和提供的控制相符合的新视频。传统方法对编辑类型有严格的限制,从而限制了它们满足用户需求的能力。在本文中,我们介绍了AnyV2V,一种新型的无训练免费框架,旨在简化视频编辑,将其分解为两个主要步骤:(1)使用一个标准的图像编辑模型(如InstructPix2Pix,InstantID等)对第一帧进行修改,(2)利用现有的图像到视频生成模型(如I2VGen-XL)进行DDIM反向和特征注入。在第一阶段,AnyV2V可以插入任何现有的图像编辑工具来支持广泛的视频编辑任务。除了传统的提示编辑方法之外,AnyV2V还可以支持新颖的视频编辑任务,包括基于参考的样式迁移、基于主题的编辑和身份 manipulation,这些任务是以前的方法无法实现的。在第二阶段,AnyV2V可以插入任何现有的图像到视频模型来进行DDIM反向和中间特征注入,以保持与原始视频的视觉效果和运动一致性。在提示编辑方面,我们证明了AnyV2V可以在提示对齐和人类偏好的基础上超越最先进的方案,分别提高了35\%和25\%。在三个新颖的任务中,我们也证明了AnyV2V具有很高的成功率。我们相信,由于其将快速发展的图像编辑方法无缝集成,AnyV2V将继续蓬勃发展。这种兼容性可以帮助AnyV2V增加其多样性以满足多样用户需求。
https://arxiv.org/abs/2403.14468
We present novel approaches involving generative adversarial networks and diffusion models in order to synthesize high quality, live and spoof fingerprint images while preserving features such as uniqueness and diversity. We generate live fingerprints from noise with a variety of methods, and we use image translation techniques to translate live fingerprint images to spoof. To generate different types of spoof images based on limited training data we incorporate style transfer techniques through a cycle autoencoder equipped with a Wasserstein metric along with Gradient Penalty (CycleWGAN-GP) in order to avoid mode collapse and instability. We find that when the spoof training data includes distinct spoof characteristics, it leads to improved live-to-spoof translation. We assess the diversity and realism of the generated live fingerprint images mainly through the Fréchet Inception Distance (FID) and the False Acceptance Rate (FAR). Our best diffusion model achieved a FID of 15.78. The comparable WGAN-GP model achieved slightly higher FID while performing better in the uniqueness assessment due to a slightly lower FAR when matched against the training data, indicating better creativity. Moreover, we give example images showing that a DDPM model clearly can generate realistic fingerprint images.
我们提出了涉及生成对抗网络(GAN)和扩散模型的新颖方法,以在保留独特性和多样性特征的同时合成高质量、活体和假体指纹图像。我们使用多种方法从噪声中生成活指纹,并使用图像转换技术将活指纹图像转换为假体。为了根据有限训练数据生成不同类型的假体图像,我们在循环自动编码器(CycleAE)上配备了Wasserstein度量(Wasserstein)和梯度惩罚(CycleWGAN-GP),以避免模式坍塌和稳定性问题。我们发现,当假体训练数据包括显著的假体特征时,会导致活体到假体的翻译更好。我们通过费舍尔切比雪夫距离(FID)和假体接受率(FAR)来评估生成的活指纹图像的多样性和逼真度。我们的最佳扩散模型获得了15.78的FID。与训练数据上 slightly lower FAR 的 WGAN-GP 模型相比,具有更高的独特性评估成绩,表明在创意上表现更好。此外,我们还提供了生成真实指纹图像的示例图像,说明DDPM模型可以生成清晰逼真的指纹图像。
https://arxiv.org/abs/2403.13916
We present a novel method for constructing Variational Autoencoder (VAE). Instead of using pixel-by-pixel loss, we enforce deep feature consistency between the input and the output of a VAE, which ensures the VAE's output to preserve the spatial correlation characteristics of the input, thus leading the output to have a more natural visual appearance and better perceptual quality. Based on recent deep learning works such as style transfer, we employ a pre-trained deep convolutional neural network (CNN) and use its hidden features to define a feature perceptual loss for VAE training. Evaluated on the CelebA face dataset, we show that our model produces better results than other methods in the literature. We also show that our method can produce latent vectors that can capture the semantic information of face expressions and can be used to achieve state-of-the-art performance in facial attribute prediction.
我们提出了一种新的用于构建变分自编码器(VAE)的方法。我们不再使用像素级损失,而是确保VAE的输入和输出之间具有深度特征的 consistency,从而确保VAE的输出保留输入的空间关联特性,从而使输出具有更自然的外观和更好的感知质量。基于最近的学习工作,如风格迁移,我们使用预训练的深度卷积神经网络(CNN)并将其隐藏特征用于定义VAE训练时的特征感知损失。在CelebA面部数据集上进行评估,我们证明了我们的模型在文献中的其他方法中具有更好的性能。我们还证明了我们的方法可以生成具有捕捉面部表情语义信息的潜在向量,并且可以用于实现面部属性的最佳性能。
https://arxiv.org/abs/1610.00291
Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.
尽管基于扩散模型的生成模型可以生成高质量的图像,但先前的作品直接生成整个图像,而无法提供对象级操作能力。为了支持更广泛的现实应用,如专业图形设计和数字艺术,通常在多层中创建和编辑图像以提供更大的灵活性和控制。因此,在本文中,我们提出了一个分层合作扩散模型,名为LayerDiff,专门为文本指导的多层可合成图像生成而设计。可合成图像由背景层、一组前景层和相关掩码层组成每个前景元素。为了实现这一目标,LayerDiff引入了一个基于层级的生成范式,包括多个层级合作注意模块来捕捉层间模式。具体来说,一个层级注意力模块被设计为鼓励层间信息交流和学习,而文本指导的内层注意力模块包括层级特定的提示以指导每个层的具体内容生成。层级特定提示增强模块更好地捕捉全局提示中的详细文本线索。此外,自掩码引导采样策略进一步释放了模型的多层图像生成能力。我们还提出了一个将现有的感知和生成模型集成到一起的生产高质量多层文本指导图像的流水线。大量实验证明,我们的LayerDiff模型可以在性能上与传统整张图像生成方法相媲美。此外,LayerDiff还允许更广泛的可控制生成应用,包括层级特定图像编辑和风格转移。
https://arxiv.org/abs/2403.11929
Previous work has shown that well-crafted adversarial perturbations can threaten the security of video recognition systems. Attackers can invade such models with a low query budget when the perturbations are semantic-invariant, such as StyleFool. Despite the query efficiency, the naturalness of the minutia areas still requires amelioration, since StyleFool leverages style transfer to all pixels in each frame. To close the gap, we propose LocalStyleFool, an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos. Benefiting from the popularity and scalably usability of Segment Anything Model (SAM), we first extract different regions according to semantic information and then track them through the video stream to maintain the temporal consistency. Then, we add style-transfer-based perturbations to several regions selected based on the associative criterion of transfer-based gradient information and regional area. Perturbation fine adjustment is followed to make stylized videos adversarial. We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey, while maintaining competitive fooling rate and query efficiency. Successful experiments on the high-resolution dataset also showcase that scrupulous segmentation of SAM helps to improve the scalability of adversarial attacks under high-resolution data.
以前的工作表明,精心制作的对抗扰动可以威胁视频识别系统的安全性。当扰动对语义不透明时,攻击者可以使用较低的查询预算侵入这些模型,例如StyleFool。尽管查询效率高,但最小区域的自然性仍然需要改进,因为StyleFool利用样式转移来对每个帧中的所有像素进行样式转移。为了弥合这一差距,我们提出了LocalStyleFool,一种改进的视频 adversarial 攻击,它超出了视频的局部样式转移。得益于 Segment Anything Model (SAM) 的流行和可扩展性,我们首先根据语义信息提取不同的区域,然后通过视频流跟踪它们以保持时间一致性。接着,我们将基于转移基于梯度信息的选择性区域添加到几个区域中。通过人类评估的调整来使逼真的视频具有攻击性。我们证明了LocalStyleFool可以通过人类评估的调查提高帧内和帧间的自然性,同时保持竞争力的 fooling 速率和查询效率。在具有高分辨率数据的高分辨率数据集上进行成功的实验,也展示了通过SAM的仔细分割,可以提高高分辨率数据中 adversarial 攻击的可扩展性。
https://arxiv.org/abs/2403.11656
Automation in medical imaging is quite challenging due to the unavailability of annotated datasets and the scarcity of domain experts. In recent years, deep learning techniques have solved some complex medical imaging tasks like disease classification, important object localization, segmentation, etc. However, most of the task requires a large amount of annotated data for their successful implementation. To mitigate the shortage of data, different generative models are proposed for data augmentation purposes which can boost the classification performances. For this, different synthetic medical image data generation models are developed to increase the dataset. Unpaired image-to-image translation models here shift the source domain to the target domain. In the breast malignancy identification domain, FNAC is one of the low-cost low-invasive modalities normally used by medical practitioners. But availability of public datasets in this domain is very poor. Whereas, for automation of cytology images, we need a large amount of annotated data. Therefore synthetic cytology images are generated by translating breast histopathology samples which are publicly available. In this study, we have explored traditional image-to-image transfer models like CycleGAN, and Neural Style Transfer. Further, it is observed that the generated cytology images are quite similar to real breast cytology samples by measuring FID and KID scores.
医学影像自动化领域是一个相当具有挑战性的任务,因为缺乏带注释的数据集和领域专家的数量。近年来,深度学习技术已经解决了一些复杂的医学影像任务,如疾病分类、重要目标定位、分割等。然而,大多数任务需要大量带注释的数据才能成功实现。为缓解数据不足,不同生成模型被提出用于数据增强,以提高分类性能。为此,不同类型的生成医学图像数据生成模型被开发以增加数据集。无配对图像到图像转换模型在这里将源域转移到目标域。在乳腺癌识别领域,FNAC是一种低成本、低侵入性的医学检查手段,但该领域的公开数据集非常缺乏。相反,用于细胞学图像自动化需要大量带注释的数据。因此,通过将乳腺癌病理学样本平移生成细胞学图像,实现了合成细胞学图像。在本研究中,我们探讨了传统图像到图像转移模型,如CycleGAN和Neural Style Transfer。此外,观察到生成的细胞学图像与真实乳腺癌细胞图像在FID和KID分数上相当相似。
https://arxiv.org/abs/2403.10885
Visual odometry plays a crucial role in endoscopic imaging, yet the scarcity of realistic images with ground truth poses poses a significant challenge. Therefore, domain adaptation offers a promising approach to bridge the pre-operative planning domain with the intra-operative real domain for learning odometry information. However, existing methodologies suffer from inefficiencies in the training time. In this work, an efficient neural style transfer framework for endoscopic visual odometry is proposed, which compresses the time from pre-operative planning to testing phase to less than five minutes. For efficient traing, this work focuses on training modules with only a limited number of real images and we exploit pre-operative prior information to dramatically reduce training duration. Moreover, during the testing phase, we propose a novel Test Time Adaptation (TTA) method to mitigate the gap in lighting conditions between training and testing datasets. Experimental evaluations conducted on two public endoscope datasets showcase that our method achieves state-of-the-art accuracy in visual odometry tasks while boasting the fastest training speeds. These results demonstrate significant promise for intra-operative surgery applications.
视觉姿态测量在内窥镜成像中扮演着关键角色,然而缺乏真实感图像是显著的挑战。因此,领域迁移是一个有前途的方法,可以将术前规划域与内窥镜实况域之间建立联系,以学习姿态信息。然而,现有的方法在训练时间上存在低效性。在这项工作中,我们提出了一个高效的内窥镜视觉姿态迁移框架,将术前规划阶段到测试阶段的所需时间缩短至不到五分钟。为了实现高效的训练,这项工作专注于训练仅包含有限数量真实图像的模块,并利用术前先验信息显著缩短训练时间。此外,在测试阶段,我们提出了一种名为Test Time Adaptation(TTA)的新方法,以弥合训练和测试数据之间的光线条件差异。对两个公开的内窥镜数据集进行的实验评估表明,我们的方法在视觉姿态测量任务上实现了最先进的准确度,同时具有最快的训练速度。这些结果表明,我们的方法在体内手术应用领域具有巨大的潜力。
https://arxiv.org/abs/2403.10860
The standard approach to tackling computer vision problems is to train deep convolutional neural network (CNN) models using large-scale image datasets which are representative of the target task. However, in many scenarios, it is often challenging to obtain sufficient image data for the target task. Data augmentation is a way to mitigate this challenge. A common practice is to explicitly transform existing images in desired ways so as to create the required volume and variability of training data necessary to achieve good generalization performance. In situations where data for the target domain is not accessible, a viable workaround is to synthesize training data from scratch--i.e., synthetic data augmentation. This paper presents an extensive review of synthetic data augmentation techniques. It covers data synthesis approaches based on realistic 3D graphics modeling, neural style transfer (NST), differential neural rendering, and generative artificial intelligence (AI) techniques such as generative adversarial networks (GANs) and variational autoencoders (VAEs). For each of these classes of methods, we focus on the important data generation and augmentation techniques, general scope of application and specific use-cases, as well as existing limitations and possible workarounds. Additionally, we provide a summary of common synthetic datasets for training computer vision models, highlighting the main features, application domains and supported tasks. Finally, we discuss the effectiveness of synthetic data augmentation methods. Since this is the first paper to explore synthetic data augmentation methods in great detail, we are hoping to equip readers with the necessary background information and in-depth knowledge of existing methods and their attendant issues.
解决计算机视觉问题的标准方法是使用大型图像数据集训练深度卷积神经网络(CNN)模型,这些数据集代表目标任务。然而,在许多情况下,获得足够的目标任务图像数据具有挑战性。数据增强是一种减轻这一挑战的方法。一种常见的做法是对现有的图像进行显式转换,以便创建实现良好泛化性能所需的训练数据量。在目标领域数据不可访问的情况下,一个可行的解决方法是从零开始合成训练数据,即合成数据增强。 本文对合成数据增强技术进行了全面的回顾。它涵盖了基于现实3D图形建模的数据生成方法、神经风格迁移(NST)、差分神经渲染和生成人工智能(AI)技术(如生成对抗网络(GANs)和变分自编码器(VAEs)的数据生成方法。对于每种方法,我们重点关注重要的数据生成和增强技术、应用范围和具体用例,以及现有的局限性和可能的解决方案。此外,我们还提供了用于训练计算机视觉模型的常见合成数据集的总结,突出了主要特点、应用领域和支持任务。最后,我们讨论了合成数据增强方法的有效性。由于这是对详细探索合成数据增强方法的第一篇论文,我们希望能够为读者提供必要的背景信息和现有方法的深入知识及其相关问题。
https://arxiv.org/abs/2403.10075
Stain normalization algorithms aim to transform the color and intensity characteristics of a source multi-gigapixel histology image to match those of a target image, mitigating inconsistencies in the appearance of stains used to highlight cellular components in the images. We propose a new approach, StainFuser, which treats this problem as a style transfer task using a novel Conditional Latent Diffusion architecture, eliminating the need for handcrafted color components. With this method, we curate SPI-2M the largest stain normalization dataset to date of over 2 million histology images with neural style transfer for high-quality transformations. Trained on this data, StainFuser outperforms current state-of-the-art GAN and handcrafted methods in terms of the quality of normalized images. Additionally, compared to existing approaches, it improves the performance of nuclei instance segmentation and classification models when used as a test time augmentation method on the challenging CoNIC dataset. Finally, we apply StainFuser on multi-gigapixel Whole Slide Images (WSIs) and demonstrate improved performance in terms of computational efficiency, image quality and consistency across tiles over current methods.
污渍归一化算法旨在将源多兆像素 histology图像的色和强度特性转化为与目标图像相同的特征,从而减轻在图像中使用污渍突出细胞组分时出现的 inconsistencies。我们提出了一种新方法StainFuser,将其视为一种风格迁移任务,利用新颖的条件随机场架构解决此问题,无需手动创建颜色组件。通过这种方法,我们curate SPI-2M,至今最大的污渍归一化数据集,为超过200万张历史图像提供神经风格迁移高质变换。在训练此数据的基础上,StainFuser在高品质图像变换方面优于当前的 GAN 和手工方法。此外,与现有方法相比,它改进了用于挑战性的 CoNIC 数据集作为测试时间扩充方法时对核实例分割和分类模型的性能。最后,我们将 StainFuser 应用于多兆像素 whole slide images (WSIs),并在计算效率、图像质量和贴图中的一致性方面展示出改进。与现有方法相比,StainFuser 在此方面的表现更加卓越。
https://arxiv.org/abs/2403.09302
Since the breakthrough of ChatGPT, large language models (LLMs) have garnered significant attention in the research community. With the development of LLMs, the question of text style transfer for conversational models has emerged as a natural extension, where chatbots may possess their own styles or even characters. However, standard evaluation metrics have not yet been established for this new settings. This paper aims to address this issue by proposing the LMStyle Benchmark, a novel evaluation framework applicable to chat-style text style transfer (C-TST), that can measure the quality of style transfer for LLMs in an automated and scalable manner. In addition to conventional style strength metrics, LMStyle Benchmark further considers a novel aspect of metrics called appropriateness, a high-level metrics take account of coherence, fluency and other implicit factors without the aid of reference samples. Our experiments demonstrate that the new evaluation methods introduced by LMStyle Benchmark have a higher correlation with human judgments in terms of appropriateness. Based on LMStyle Benchmark, we present a comprehensive list of evaluation results for popular LLMs, including LLaMA, Alpaca, and Vicuna, reflecting their stylistic properties, such as formality and sentiment strength, along with their appropriateness.
自ChatGPT突破以来,大型语言模型(LLMs)在研究社区中引起了广泛关注。随着LLMs的发展,对于对话模型的文本风格迁移问题成为一个自然扩展,其中聊天机器人可能具有自己独特的风格,甚至角色。然而,对于这种新设置,尚未建立标准的评估指标。本文旨在通过提出LMStyle基准,一种适用于对话式文本风格迁移(C-TST)的新评估框架,来解决这个问题。除了传统的风格强度指标之外,LMStyle基准还考虑了一个新的指标,称为适用性,这是一个高级指标,没有参考样本的情况下,考虑了连贯性、流畅性等隐含因素。我们的实验结果表明,LMStyle基准引入的新评估方法与人类判断在适用性方面具有更高的相关性。基于LMStyle基准,我们为流行的LLMs提供了全面的评估结果,包括LLLaMA、Alpaca和Vicuna,反映了它们的文体性质(如正式性和情感强度)以及适用性。
https://arxiv.org/abs/2403.08943
Scene stylization extends the work of neural style transfer to three spatial dimensions. A vital challenge in this problem is to maintain the uniformity of the stylized appearance across a multi-view setting. A vast majority of the previous works achieve this by optimizing the scene with a specific style image. In contrast, we propose a novel architecture trained on a collection of style images, that at test time produces high quality stylized novel views. Our work builds up on the framework of 3D Gaussian splatting. For a given scene, we take the pretrained Gaussians and process them using a multi resolution hash grid and a tiny MLP to obtain the conditional stylised views. The explicit nature of 3D Gaussians give us inherent advantages over NeRF-based methods including geometric consistency, along with having a fast training and rendering regime. This enables our method to be useful for vast practical use cases such as in augmented or virtual reality applications. Through our experiments, we show our methods achieve state-of-the-art performance with superior visual quality on various indoor and outdoor real-world data.
场景风格化扩展了神经风格迁移在三维空间的工作。这个问题中的一个关键挑战是保持风格化外观在多视角设置中的统一性。大部分之前的工作通过优化特定风格图像的场景来实现这一点。相比之下,我们提出了一个基于样式图像的集合训练的新模型,在测试时产生高质量的风格化新视图。我们的工作基于3D高斯分层的框架。对于给定的场景,我们使用预训练的高斯核并对其进行多分辨率哈希网格处理和微小的MLP处理,以获得条件风格化视图。3D高斯核的显式性质使我们比基于NeRF的方法具有更强的几何一致性,并具有快速训练和渲染模式。这使我们能够为诸如增强现实和虚拟现实等广泛应用场景提供有用的方法。通过我们的实验,我们证明了我们的方法在各种室内和室外现实数据上实现了最先进的性能,具有卓越的视觉质量。
https://arxiv.org/abs/2403.08498
4D style transfer aims at transferring arbitrary visual style to the synthesized novel views of a dynamic 4D scene with varying viewpoints and times. Existing efforts on 3D style transfer can effectively combine the visual features of style images and neural radiance fields (NeRF) but fail to handle the 4D dynamic scenes limited by the static scene assumption. Consequently, we aim to handle the novel challenging problem of 4D style transfer for the first time, which further requires the consistency of stylized results on dynamic objects. In this paper, we introduce StyleDyRF, a method that represents the 4D feature space by deforming a canonical feature volume and learns a linear style transformation matrix on the feature volume in a data-driven fashion. To obtain the canonical feature volume, the rays at each time step are deformed with the geometric prior of a pre-trained dynamic NeRF to render the feature map under the supervision of pre-trained visual encoders. With the content and style cues in the canonical feature volume and the style image, we can learn the style transformation matrix from their covariance matrices with lightweight neural networks. The learned style transformation matrix can reflect a direct matching of feature covariance from the content volume to the given style pattern, in analogy with the optimization of the Gram matrix in traditional 2D neural style transfer. The experimental results show that our method not only renders 4D photorealistic style transfer results in a zero-shot manner but also outperforms existing methods in terms of visual quality and consistency.
4D风格转移的目的是将任意视觉风格从一个动态4D场景的合成新视角中转移,具有不同的视点和时间。现有关于3D风格转移的努力可以有效地将风格图的视觉特征和神经辐射场(NeRF)的视觉特征结合在一起,但无法处理受静态场景假设限制的4D动态场景。因此,我们旨在首次处理4D风格转移这一新颖挑战问题,这进一步需要动态对象上风格化的结果保持一致。在本文中,我们引入了StyleDyRF方法,该方法通过变形一个规范的特征卷积来表示4D特征空间,并以数据驱动的方式在特征卷积中学习一个线性风格变换矩阵。为了获得规范的特征卷积,每个时间步骤的光线通过预训练动态NeRF的几何先验进行变形,然后在预训练视觉编码器的监督下渲染特征图。有了规范的特征卷积和风格图,我们可以通过轻量级神经网络的学习从它们的协方差矩阵中学习风格变换矩阵。学习到的风格变换矩阵可以反映从内容卷积到给定风格模式的直接匹配,类似于传统2D神经风格转移中Gram矩阵的优化。实验结果表明,我们的方法不仅在零散射击方式下渲染4D照片真实感的风格转移结果,而且在视觉质量和一致性方面优于现有方法。
https://arxiv.org/abs/2403.08310
Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.
现有的基于条件图像生成模型的生成对抗网络(GAN)通常会对相同的条件输入产生固定的输出,这对于高度主观的任务(如大规模掩码图像修复或风格迁移)来说是不合理的。另一方面,基于GAN的多样图像生成方法需要重新训练或微调网络或设计复杂的噪声注入函数,这会导致计算开销、任务特定或很难生成高质量结果。鉴于许多确定性条件图像生成模型已经能够产生高质量但固定的结果,我们提出了一个有趣的问题:是否可以在不改变网络结构或参数的情况下,使预训练的确定性条件图像生成模型产生多样化的结果?为了回答这个问题,我们重新审视了条件图像生成任务,从攻击者的角度出发,提出了一种简单而有效的插值平滑梯度下降(PGD)类似方法,用于多样且可控制图像生成。关键思想是对输入条件添加一个微小的扰动。这样,就可以在不调整网络结构或对预训练模型进行微调的情况下生成多样化的结果。此外,我们还可以根据参考文本或图像指定攻击方向,从而控制生成的多样结果。我们的工作为将对抗攻击应用于低级视觉任务打开了大门,而各种条件图像生成任务的实验结果也证明了所提出方法的有效性和优越性。
https://arxiv.org/abs/2403.08294
Authorship style transfer aims to rewrite a given text into a specified target while preserving the original meaning in the source. Existing approaches rely on the availability of a large number of target style exemplars for model training. However, these overlook cases where a limited number of target style examples are available. The development of parameter-efficient transfer learning techniques and policy optimization (PO) approaches suggest lightweight PO is a feasible approach to low-resource style transfer. In this work, we propose a simple two step tune-and-optimize technique for low-resource textual style transfer. We apply our technique to authorship transfer as well as a larger-data native language style task and in both cases find it outperforms state-of-the-art baseline models.
翻译:跨作者性风格迁移的目的是将给定的文本转换为指定目标,同时保留原始文本的含义。现有的方法依赖于大量目标风格示例的可用性来进行模型训练。然而,这些方法忽视了目标风格示例数量有限的情况。参数高效的迁移学习技术和策略优化(PO)方法表明,轻量级PO是一种低资源风格迁移的可行方法。在这项工作中,我们提出了一种简单的两步调参和优化技术用于低资源文本风格迁移。我们将我们的技术应用于作者hip转移和大数据本土语言风格任务,并在两种情况下发现它优于最先进的基准模型。
https://arxiv.org/abs/2403.08043
We introduce StyleGaussian, a novel 3D style transfer technique that allows instant transfer of any image's style to a 3D scene at 10 frames per second (fps). Leveraging 3D Gaussian Splatting (3DGS), StyleGaussian achieves style transfer without compromising its real-time rendering ability and multi-view consistency. It achieves instant style transfer with three steps: embedding, transfer, and decoding. Initially, 2D VGG scene features are embedded into reconstructed 3D Gaussians. Next, the embedded features are transformed according to a reference style image. Finally, the transformed features are decoded into the stylized RGB. StyleGaussian has two novel designs. The first is an efficient feature rendering strategy that first renders low-dimensional features and then maps them into high-dimensional features while embedding VGG features. It cuts the memory consumption significantly and enables 3DGS to render the high-dimensional memory-intensive features. The second is a K-nearest-neighbor-based 3D CNN. Working as the decoder for the stylized features, it eliminates the 2D CNN operations that compromise strict multi-view consistency. Extensive experiments show that StyleGaussian achieves instant 3D stylization with superior stylization quality while preserving real-time rendering and strict multi-view consistency. Project page: this https URL
我们介绍了一种名为StyleGaussian的新3D风格迁移技术,它允许在每秒10帧(fps)的情况下将任何图像的风格立即传输到3D场景中。利用3D高斯平展(3DGS),StyleGaussian在保持实时渲染能力和多视角一致性的同时实现风格迁移。它通过三个步骤实现即嵌入、传输和解码。最初,2D VGG场景特征被嵌入重构的3D高斯中。接下来,根据参考风格图像对嵌入的特征进行变换。最后,变换后的特征被解码为风格化的RGB。StyleGaussian有两个新颖的设计。第一个设计是高效的特征渲染策略,它首先渲染低维特征,然后将它们映射到高维特征,同时嵌入VGG特征。它极大地减少了内存消耗,并使3DGS能够渲染高维内存密集型特征。第二个设计是基于K近邻(KNN)的3D CNN。作为风格化的特征的解码器,它消除了2D CNN操作,这些操作会破坏严格的 Multi-View 一致性。大量的实验结果表明,StyleGaussian在保留实时渲染和严格 Multi-View 一致性的同时,以卓越的的风格质量实现即时 3D 风格化。项目页面:此链接:<https:// this URL>
https://arxiv.org/abs/2403.07807
Recently, and under the umbrella of Responsible AI, efforts have been made to develop gender-ambiguous synthetic speech to represent with a single voice all individuals in the gender spectrum. However, research efforts have completely overlooked the speaking style despite differences found among binary and non-binary populations. In this work, we synthesise gender-ambiguous speech by combining the timbre of a male speaker with the manner of speech of a female speaker using voice morphing and pitch shifting towards the male-female boundary. Subjective evaluations indicate that the ambiguity of the morphed samples that convey the female speech style is higher than those that undergo pure pitch transformations suggesting that the speaking style can be a contributing factor in creating gender-ambiguous speech. To our knowledge, this is the first study that explicitly uses the transfer of the speaking style to create gender-ambiguous voices.
近年来,在负责AI的背景下,努力开发了能够用单个声音代表性别范围内所有个体的性别歧义合成语音。然而,尽管在二进制和非二进制人群中发现了差异,但研究努力完全忽视了说话方式。在这项工作中,我们通过将男性说话者的音色和女性说话者的语调相结合,在进行语音变形和音高平移,以跨越男女性别边界,合成性别歧义 speech。主观评价表明,经过变换的样本中传达女性说话方式的主观性比经过纯音高变换的样本中更高,这表明说话方式可能是一个导致性别歧义 speech的因素。据我们所知,这是第一个明确研究将说话方式转移以创建性别歧义 voices 的第一项研究。
https://arxiv.org/abs/2403.07661
3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision. Although some related works have made great progress in this field, most of the existing methods suffer from poor disentanglement performance of shape and appearance, and lack multi-modal control. In this paper, we propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs, including pure noise, text and reference image. On the one hand, we dive into the latent space of 3D Generative Adversarial Networks (GANs) and propose a novel disentanglement strategy to separate appearance features from shape features during the generation process. On the other hand, we propose a unified framework for flexible image generation and editing tasks with multi-modal conditions. Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image. Extensive experiments demonstrate that the proposed method outperforms alternative approaches both qualitatively and quantitatively on image generation and editing.
3D-一致图像生成是一个重要的计算机图形学和计算机视觉研究课题。尽管在这个领域中已经取得了一些相关的进展,但现有的方法在形状和外观的分离性能上往往表现不佳,并且缺乏多模态控制。在本文中,我们提出了一个新型的端到端3D意识图像生成和编辑模型,包括纯噪声、文本和参考图像等多种条件输入。一方面,我们深入探索了3D生成对抗网络(GANs)的潜在空间,并提出了一种新的分离策略,在生成过程中将外观特征与形状特征分离。另一方面,我们提出了一个多模态条件下灵活图像生成和编辑任务的统一框架。我们的方法可以生成具有独特噪声的多样图像,通过文本描述编辑属性,并通过给定参考RGB图像进行风格迁移。大量实验证明,与 alternative 方法相比,所提出的方法在图像生成和编辑方面都表现出优异的性能。
https://arxiv.org/abs/2403.06470