Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.
尽管基于扩散模型的生成模型可以生成高质量的图像,但先前的作品直接生成整个图像,而无法提供对象级操作能力。为了支持更广泛的现实应用,如专业图形设计和数字艺术,通常在多层中创建和编辑图像以提供更大的灵活性和控制。因此,在本文中,我们提出了一个分层合作扩散模型,名为LayerDiff,专门为文本指导的多层可合成图像生成而设计。可合成图像由背景层、一组前景层和相关掩码层组成每个前景元素。为了实现这一目标,LayerDiff引入了一个基于层级的生成范式,包括多个层级合作注意模块来捕捉层间模式。具体来说,一个层级注意力模块被设计为鼓励层间信息交流和学习,而文本指导的内层注意力模块包括层级特定的提示以指导每个层的具体内容生成。层级特定提示增强模块更好地捕捉全局提示中的详细文本线索。此外,自掩码引导采样策略进一步释放了模型的多层图像生成能力。我们还提出了一个将现有的感知和生成模型集成到一起的生产高质量多层文本指导图像的流水线。大量实验证明,我们的LayerDiff模型可以在性能上与传统整张图像生成方法相媲美。此外,LayerDiff还允许更广泛的可控制生成应用,包括层级特定图像编辑和风格转移。
https://arxiv.org/abs/2403.11929
Previous work has shown that well-crafted adversarial perturbations can threaten the security of video recognition systems. Attackers can invade such models with a low query budget when the perturbations are semantic-invariant, such as StyleFool. Despite the query efficiency, the naturalness of the minutia areas still requires amelioration, since StyleFool leverages style transfer to all pixels in each frame. To close the gap, we propose LocalStyleFool, an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos. Benefiting from the popularity and scalably usability of Segment Anything Model (SAM), we first extract different regions according to semantic information and then track them through the video stream to maintain the temporal consistency. Then, we add style-transfer-based perturbations to several regions selected based on the associative criterion of transfer-based gradient information and regional area. Perturbation fine adjustment is followed to make stylized videos adversarial. We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey, while maintaining competitive fooling rate and query efficiency. Successful experiments on the high-resolution dataset also showcase that scrupulous segmentation of SAM helps to improve the scalability of adversarial attacks under high-resolution data.
以前的工作表明,精心制作的对抗扰动可以威胁视频识别系统的安全性。当扰动对语义不透明时,攻击者可以使用较低的查询预算侵入这些模型,例如StyleFool。尽管查询效率高,但最小区域的自然性仍然需要改进,因为StyleFool利用样式转移来对每个帧中的所有像素进行样式转移。为了弥合这一差距,我们提出了LocalStyleFool,一种改进的视频 adversarial 攻击,它超出了视频的局部样式转移。得益于 Segment Anything Model (SAM) 的流行和可扩展性,我们首先根据语义信息提取不同的区域,然后通过视频流跟踪它们以保持时间一致性。接着,我们将基于转移基于梯度信息的选择性区域添加到几个区域中。通过人类评估的调整来使逼真的视频具有攻击性。我们证明了LocalStyleFool可以通过人类评估的调查提高帧内和帧间的自然性,同时保持竞争力的 fooling 速率和查询效率。在具有高分辨率数据的高分辨率数据集上进行成功的实验,也展示了通过SAM的仔细分割,可以提高高分辨率数据中 adversarial 攻击的可扩展性。
https://arxiv.org/abs/2403.11656
Automation in medical imaging is quite challenging due to the unavailability of annotated datasets and the scarcity of domain experts. In recent years, deep learning techniques have solved some complex medical imaging tasks like disease classification, important object localization, segmentation, etc. However, most of the task requires a large amount of annotated data for their successful implementation. To mitigate the shortage of data, different generative models are proposed for data augmentation purposes which can boost the classification performances. For this, different synthetic medical image data generation models are developed to increase the dataset. Unpaired image-to-image translation models here shift the source domain to the target domain. In the breast malignancy identification domain, FNAC is one of the low-cost low-invasive modalities normally used by medical practitioners. But availability of public datasets in this domain is very poor. Whereas, for automation of cytology images, we need a large amount of annotated data. Therefore synthetic cytology images are generated by translating breast histopathology samples which are publicly available. In this study, we have explored traditional image-to-image transfer models like CycleGAN, and Neural Style Transfer. Further, it is observed that the generated cytology images are quite similar to real breast cytology samples by measuring FID and KID scores.
医学影像自动化领域是一个相当具有挑战性的任务,因为缺乏带注释的数据集和领域专家的数量。近年来,深度学习技术已经解决了一些复杂的医学影像任务,如疾病分类、重要目标定位、分割等。然而,大多数任务需要大量带注释的数据才能成功实现。为缓解数据不足,不同生成模型被提出用于数据增强,以提高分类性能。为此,不同类型的生成医学图像数据生成模型被开发以增加数据集。无配对图像到图像转换模型在这里将源域转移到目标域。在乳腺癌识别领域,FNAC是一种低成本、低侵入性的医学检查手段,但该领域的公开数据集非常缺乏。相反,用于细胞学图像自动化需要大量带注释的数据。因此,通过将乳腺癌病理学样本平移生成细胞学图像,实现了合成细胞学图像。在本研究中,我们探讨了传统图像到图像转移模型,如CycleGAN和Neural Style Transfer。此外,观察到生成的细胞学图像与真实乳腺癌细胞图像在FID和KID分数上相当相似。
https://arxiv.org/abs/2403.10885
Visual odometry plays a crucial role in endoscopic imaging, yet the scarcity of realistic images with ground truth poses poses a significant challenge. Therefore, domain adaptation offers a promising approach to bridge the pre-operative planning domain with the intra-operative real domain for learning odometry information. However, existing methodologies suffer from inefficiencies in the training time. In this work, an efficient neural style transfer framework for endoscopic visual odometry is proposed, which compresses the time from pre-operative planning to testing phase to less than five minutes. For efficient traing, this work focuses on training modules with only a limited number of real images and we exploit pre-operative prior information to dramatically reduce training duration. Moreover, during the testing phase, we propose a novel Test Time Adaptation (TTA) method to mitigate the gap in lighting conditions between training and testing datasets. Experimental evaluations conducted on two public endoscope datasets showcase that our method achieves state-of-the-art accuracy in visual odometry tasks while boasting the fastest training speeds. These results demonstrate significant promise for intra-operative surgery applications.
视觉姿态测量在内窥镜成像中扮演着关键角色,然而缺乏真实感图像是显著的挑战。因此,领域迁移是一个有前途的方法,可以将术前规划域与内窥镜实况域之间建立联系,以学习姿态信息。然而,现有的方法在训练时间上存在低效性。在这项工作中,我们提出了一个高效的内窥镜视觉姿态迁移框架,将术前规划阶段到测试阶段的所需时间缩短至不到五分钟。为了实现高效的训练,这项工作专注于训练仅包含有限数量真实图像的模块,并利用术前先验信息显著缩短训练时间。此外,在测试阶段,我们提出了一种名为Test Time Adaptation(TTA)的新方法,以弥合训练和测试数据之间的光线条件差异。对两个公开的内窥镜数据集进行的实验评估表明,我们的方法在视觉姿态测量任务上实现了最先进的准确度,同时具有最快的训练速度。这些结果表明,我们的方法在体内手术应用领域具有巨大的潜力。
https://arxiv.org/abs/2403.10860
The standard approach to tackling computer vision problems is to train deep convolutional neural network (CNN) models using large-scale image datasets which are representative of the target task. However, in many scenarios, it is often challenging to obtain sufficient image data for the target task. Data augmentation is a way to mitigate this challenge. A common practice is to explicitly transform existing images in desired ways so as to create the required volume and variability of training data necessary to achieve good generalization performance. In situations where data for the target domain is not accessible, a viable workaround is to synthesize training data from scratch--i.e., synthetic data augmentation. This paper presents an extensive review of synthetic data augmentation techniques. It covers data synthesis approaches based on realistic 3D graphics modeling, neural style transfer (NST), differential neural rendering, and generative artificial intelligence (AI) techniques such as generative adversarial networks (GANs) and variational autoencoders (VAEs). For each of these classes of methods, we focus on the important data generation and augmentation techniques, general scope of application and specific use-cases, as well as existing limitations and possible workarounds. Additionally, we provide a summary of common synthetic datasets for training computer vision models, highlighting the main features, application domains and supported tasks. Finally, we discuss the effectiveness of synthetic data augmentation methods. Since this is the first paper to explore synthetic data augmentation methods in great detail, we are hoping to equip readers with the necessary background information and in-depth knowledge of existing methods and their attendant issues.
解决计算机视觉问题的标准方法是使用大型图像数据集训练深度卷积神经网络(CNN)模型,这些数据集代表目标任务。然而,在许多情况下,获得足够的目标任务图像数据具有挑战性。数据增强是一种减轻这一挑战的方法。一种常见的做法是对现有的图像进行显式转换,以便创建实现良好泛化性能所需的训练数据量。在目标领域数据不可访问的情况下,一个可行的解决方法是从零开始合成训练数据,即合成数据增强。 本文对合成数据增强技术进行了全面的回顾。它涵盖了基于现实3D图形建模的数据生成方法、神经风格迁移(NST)、差分神经渲染和生成人工智能(AI)技术(如生成对抗网络(GANs)和变分自编码器(VAEs)的数据生成方法。对于每种方法,我们重点关注重要的数据生成和增强技术、应用范围和具体用例,以及现有的局限性和可能的解决方案。此外,我们还提供了用于训练计算机视觉模型的常见合成数据集的总结,突出了主要特点、应用领域和支持任务。最后,我们讨论了合成数据增强方法的有效性。由于这是对详细探索合成数据增强方法的第一篇论文,我们希望能够为读者提供必要的背景信息和现有方法的深入知识及其相关问题。
https://arxiv.org/abs/2403.10075
Stain normalization algorithms aim to transform the color and intensity characteristics of a source multi-gigapixel histology image to match those of a target image, mitigating inconsistencies in the appearance of stains used to highlight cellular components in the images. We propose a new approach, StainFuser, which treats this problem as a style transfer task using a novel Conditional Latent Diffusion architecture, eliminating the need for handcrafted color components. With this method, we curate SPI-2M the largest stain normalization dataset to date of over 2 million histology images with neural style transfer for high-quality transformations. Trained on this data, StainFuser outperforms current state-of-the-art GAN and handcrafted methods in terms of the quality of normalized images. Additionally, compared to existing approaches, it improves the performance of nuclei instance segmentation and classification models when used as a test time augmentation method on the challenging CoNIC dataset. Finally, we apply StainFuser on multi-gigapixel Whole Slide Images (WSIs) and demonstrate improved performance in terms of computational efficiency, image quality and consistency across tiles over current methods.
污渍归一化算法旨在将源多兆像素 histology图像的色和强度特性转化为与目标图像相同的特征,从而减轻在图像中使用污渍突出细胞组分时出现的 inconsistencies。我们提出了一种新方法StainFuser,将其视为一种风格迁移任务,利用新颖的条件随机场架构解决此问题,无需手动创建颜色组件。通过这种方法,我们curate SPI-2M,至今最大的污渍归一化数据集,为超过200万张历史图像提供神经风格迁移高质变换。在训练此数据的基础上,StainFuser在高品质图像变换方面优于当前的 GAN 和手工方法。此外,与现有方法相比,它改进了用于挑战性的 CoNIC 数据集作为测试时间扩充方法时对核实例分割和分类模型的性能。最后,我们将 StainFuser 应用于多兆像素 whole slide images (WSIs),并在计算效率、图像质量和贴图中的一致性方面展示出改进。与现有方法相比,StainFuser 在此方面的表现更加卓越。
https://arxiv.org/abs/2403.09302
Since the breakthrough of ChatGPT, large language models (LLMs) have garnered significant attention in the research community. With the development of LLMs, the question of text style transfer for conversational models has emerged as a natural extension, where chatbots may possess their own styles or even characters. However, standard evaluation metrics have not yet been established for this new settings. This paper aims to address this issue by proposing the LMStyle Benchmark, a novel evaluation framework applicable to chat-style text style transfer (C-TST), that can measure the quality of style transfer for LLMs in an automated and scalable manner. In addition to conventional style strength metrics, LMStyle Benchmark further considers a novel aspect of metrics called appropriateness, a high-level metrics take account of coherence, fluency and other implicit factors without the aid of reference samples. Our experiments demonstrate that the new evaluation methods introduced by LMStyle Benchmark have a higher correlation with human judgments in terms of appropriateness. Based on LMStyle Benchmark, we present a comprehensive list of evaluation results for popular LLMs, including LLaMA, Alpaca, and Vicuna, reflecting their stylistic properties, such as formality and sentiment strength, along with their appropriateness.
自ChatGPT突破以来,大型语言模型(LLMs)在研究社区中引起了广泛关注。随着LLMs的发展,对于对话模型的文本风格迁移问题成为一个自然扩展,其中聊天机器人可能具有自己独特的风格,甚至角色。然而,对于这种新设置,尚未建立标准的评估指标。本文旨在通过提出LMStyle基准,一种适用于对话式文本风格迁移(C-TST)的新评估框架,来解决这个问题。除了传统的风格强度指标之外,LMStyle基准还考虑了一个新的指标,称为适用性,这是一个高级指标,没有参考样本的情况下,考虑了连贯性、流畅性等隐含因素。我们的实验结果表明,LMStyle基准引入的新评估方法与人类判断在适用性方面具有更高的相关性。基于LMStyle基准,我们为流行的LLMs提供了全面的评估结果,包括LLLaMA、Alpaca和Vicuna,反映了它们的文体性质(如正式性和情感强度)以及适用性。
https://arxiv.org/abs/2403.08943
Scene stylization extends the work of neural style transfer to three spatial dimensions. A vital challenge in this problem is to maintain the uniformity of the stylized appearance across a multi-view setting. A vast majority of the previous works achieve this by optimizing the scene with a specific style image. In contrast, we propose a novel architecture trained on a collection of style images, that at test time produces high quality stylized novel views. Our work builds up on the framework of 3D Gaussian splatting. For a given scene, we take the pretrained Gaussians and process them using a multi resolution hash grid and a tiny MLP to obtain the conditional stylised views. The explicit nature of 3D Gaussians give us inherent advantages over NeRF-based methods including geometric consistency, along with having a fast training and rendering regime. This enables our method to be useful for vast practical use cases such as in augmented or virtual reality applications. Through our experiments, we show our methods achieve state-of-the-art performance with superior visual quality on various indoor and outdoor real-world data.
场景风格化扩展了神经风格迁移在三维空间的工作。这个问题中的一个关键挑战是保持风格化外观在多视角设置中的统一性。大部分之前的工作通过优化特定风格图像的场景来实现这一点。相比之下,我们提出了一个基于样式图像的集合训练的新模型,在测试时产生高质量的风格化新视图。我们的工作基于3D高斯分层的框架。对于给定的场景,我们使用预训练的高斯核并对其进行多分辨率哈希网格处理和微小的MLP处理,以获得条件风格化视图。3D高斯核的显式性质使我们比基于NeRF的方法具有更强的几何一致性,并具有快速训练和渲染模式。这使我们能够为诸如增强现实和虚拟现实等广泛应用场景提供有用的方法。通过我们的实验,我们证明了我们的方法在各种室内和室外现实数据上实现了最先进的性能,具有卓越的视觉质量。
https://arxiv.org/abs/2403.08498
4D style transfer aims at transferring arbitrary visual style to the synthesized novel views of a dynamic 4D scene with varying viewpoints and times. Existing efforts on 3D style transfer can effectively combine the visual features of style images and neural radiance fields (NeRF) but fail to handle the 4D dynamic scenes limited by the static scene assumption. Consequently, we aim to handle the novel challenging problem of 4D style transfer for the first time, which further requires the consistency of stylized results on dynamic objects. In this paper, we introduce StyleDyRF, a method that represents the 4D feature space by deforming a canonical feature volume and learns a linear style transformation matrix on the feature volume in a data-driven fashion. To obtain the canonical feature volume, the rays at each time step are deformed with the geometric prior of a pre-trained dynamic NeRF to render the feature map under the supervision of pre-trained visual encoders. With the content and style cues in the canonical feature volume and the style image, we can learn the style transformation matrix from their covariance matrices with lightweight neural networks. The learned style transformation matrix can reflect a direct matching of feature covariance from the content volume to the given style pattern, in analogy with the optimization of the Gram matrix in traditional 2D neural style transfer. The experimental results show that our method not only renders 4D photorealistic style transfer results in a zero-shot manner but also outperforms existing methods in terms of visual quality and consistency.
4D风格转移的目的是将任意视觉风格从一个动态4D场景的合成新视角中转移,具有不同的视点和时间。现有关于3D风格转移的努力可以有效地将风格图的视觉特征和神经辐射场(NeRF)的视觉特征结合在一起,但无法处理受静态场景假设限制的4D动态场景。因此,我们旨在首次处理4D风格转移这一新颖挑战问题,这进一步需要动态对象上风格化的结果保持一致。在本文中,我们引入了StyleDyRF方法,该方法通过变形一个规范的特征卷积来表示4D特征空间,并以数据驱动的方式在特征卷积中学习一个线性风格变换矩阵。为了获得规范的特征卷积,每个时间步骤的光线通过预训练动态NeRF的几何先验进行变形,然后在预训练视觉编码器的监督下渲染特征图。有了规范的特征卷积和风格图,我们可以通过轻量级神经网络的学习从它们的协方差矩阵中学习风格变换矩阵。学习到的风格变换矩阵可以反映从内容卷积到给定风格模式的直接匹配,类似于传统2D神经风格转移中Gram矩阵的优化。实验结果表明,我们的方法不仅在零散射击方式下渲染4D照片真实感的风格转移结果,而且在视觉质量和一致性方面优于现有方法。
https://arxiv.org/abs/2403.08310
Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.
现有的基于条件图像生成模型的生成对抗网络(GAN)通常会对相同的条件输入产生固定的输出,这对于高度主观的任务(如大规模掩码图像修复或风格迁移)来说是不合理的。另一方面,基于GAN的多样图像生成方法需要重新训练或微调网络或设计复杂的噪声注入函数,这会导致计算开销、任务特定或很难生成高质量结果。鉴于许多确定性条件图像生成模型已经能够产生高质量但固定的结果,我们提出了一个有趣的问题:是否可以在不改变网络结构或参数的情况下,使预训练的确定性条件图像生成模型产生多样化的结果?为了回答这个问题,我们重新审视了条件图像生成任务,从攻击者的角度出发,提出了一种简单而有效的插值平滑梯度下降(PGD)类似方法,用于多样且可控制图像生成。关键思想是对输入条件添加一个微小的扰动。这样,就可以在不调整网络结构或对预训练模型进行微调的情况下生成多样化的结果。此外,我们还可以根据参考文本或图像指定攻击方向,从而控制生成的多样结果。我们的工作为将对抗攻击应用于低级视觉任务打开了大门,而各种条件图像生成任务的实验结果也证明了所提出方法的有效性和优越性。
https://arxiv.org/abs/2403.08294
Authorship style transfer aims to rewrite a given text into a specified target while preserving the original meaning in the source. Existing approaches rely on the availability of a large number of target style exemplars for model training. However, these overlook cases where a limited number of target style examples are available. The development of parameter-efficient transfer learning techniques and policy optimization (PO) approaches suggest lightweight PO is a feasible approach to low-resource style transfer. In this work, we propose a simple two step tune-and-optimize technique for low-resource textual style transfer. We apply our technique to authorship transfer as well as a larger-data native language style task and in both cases find it outperforms state-of-the-art baseline models.
翻译:跨作者性风格迁移的目的是将给定的文本转换为指定目标,同时保留原始文本的含义。现有的方法依赖于大量目标风格示例的可用性来进行模型训练。然而,这些方法忽视了目标风格示例数量有限的情况。参数高效的迁移学习技术和策略优化(PO)方法表明,轻量级PO是一种低资源风格迁移的可行方法。在这项工作中,我们提出了一种简单的两步调参和优化技术用于低资源文本风格迁移。我们将我们的技术应用于作者hip转移和大数据本土语言风格任务,并在两种情况下发现它优于最先进的基准模型。
https://arxiv.org/abs/2403.08043
We introduce StyleGaussian, a novel 3D style transfer technique that allows instant transfer of any image's style to a 3D scene at 10 frames per second (fps). Leveraging 3D Gaussian Splatting (3DGS), StyleGaussian achieves style transfer without compromising its real-time rendering ability and multi-view consistency. It achieves instant style transfer with three steps: embedding, transfer, and decoding. Initially, 2D VGG scene features are embedded into reconstructed 3D Gaussians. Next, the embedded features are transformed according to a reference style image. Finally, the transformed features are decoded into the stylized RGB. StyleGaussian has two novel designs. The first is an efficient feature rendering strategy that first renders low-dimensional features and then maps them into high-dimensional features while embedding VGG features. It cuts the memory consumption significantly and enables 3DGS to render the high-dimensional memory-intensive features. The second is a K-nearest-neighbor-based 3D CNN. Working as the decoder for the stylized features, it eliminates the 2D CNN operations that compromise strict multi-view consistency. Extensive experiments show that StyleGaussian achieves instant 3D stylization with superior stylization quality while preserving real-time rendering and strict multi-view consistency. Project page: this https URL
我们介绍了一种名为StyleGaussian的新3D风格迁移技术,它允许在每秒10帧(fps)的情况下将任何图像的风格立即传输到3D场景中。利用3D高斯平展(3DGS),StyleGaussian在保持实时渲染能力和多视角一致性的同时实现风格迁移。它通过三个步骤实现即嵌入、传输和解码。最初,2D VGG场景特征被嵌入重构的3D高斯中。接下来,根据参考风格图像对嵌入的特征进行变换。最后,变换后的特征被解码为风格化的RGB。StyleGaussian有两个新颖的设计。第一个设计是高效的特征渲染策略,它首先渲染低维特征,然后将它们映射到高维特征,同时嵌入VGG特征。它极大地减少了内存消耗,并使3DGS能够渲染高维内存密集型特征。第二个设计是基于K近邻(KNN)的3D CNN。作为风格化的特征的解码器,它消除了2D CNN操作,这些操作会破坏严格的 Multi-View 一致性。大量的实验结果表明,StyleGaussian在保留实时渲染和严格 Multi-View 一致性的同时,以卓越的的风格质量实现即时 3D 风格化。项目页面:此链接:<https:// this URL>
https://arxiv.org/abs/2403.07807
Recently, and under the umbrella of Responsible AI, efforts have been made to develop gender-ambiguous synthetic speech to represent with a single voice all individuals in the gender spectrum. However, research efforts have completely overlooked the speaking style despite differences found among binary and non-binary populations. In this work, we synthesise gender-ambiguous speech by combining the timbre of a male speaker with the manner of speech of a female speaker using voice morphing and pitch shifting towards the male-female boundary. Subjective evaluations indicate that the ambiguity of the morphed samples that convey the female speech style is higher than those that undergo pure pitch transformations suggesting that the speaking style can be a contributing factor in creating gender-ambiguous speech. To our knowledge, this is the first study that explicitly uses the transfer of the speaking style to create gender-ambiguous voices.
近年来,在负责AI的背景下,努力开发了能够用单个声音代表性别范围内所有个体的性别歧义合成语音。然而,尽管在二进制和非二进制人群中发现了差异,但研究努力完全忽视了说话方式。在这项工作中,我们通过将男性说话者的音色和女性说话者的语调相结合,在进行语音变形和音高平移,以跨越男女性别边界,合成性别歧义 speech。主观评价表明,经过变换的样本中传达女性说话方式的主观性比经过纯音高变换的样本中更高,这表明说话方式可能是一个导致性别歧义 speech的因素。据我们所知,这是第一个明确研究将说话方式转移以创建性别歧义 voices 的第一项研究。
https://arxiv.org/abs/2403.07661
3D-consistent image generation from a single 2D semantic label is an important and challenging research topic in computer graphics and computer vision. Although some related works have made great progress in this field, most of the existing methods suffer from poor disentanglement performance of shape and appearance, and lack multi-modal control. In this paper, we propose a novel end-to-end 3D-aware image generation and editing model incorporating multiple types of conditional inputs, including pure noise, text and reference image. On the one hand, we dive into the latent space of 3D Generative Adversarial Networks (GANs) and propose a novel disentanglement strategy to separate appearance features from shape features during the generation process. On the other hand, we propose a unified framework for flexible image generation and editing tasks with multi-modal conditions. Our method can generate diverse images with distinct noises, edit the attribute through a text description and conduct style transfer by giving a reference RGB image. Extensive experiments demonstrate that the proposed method outperforms alternative approaches both qualitatively and quantitatively on image generation and editing.
3D-一致图像生成是一个重要的计算机图形学和计算机视觉研究课题。尽管在这个领域中已经取得了一些相关的进展,但现有的方法在形状和外观的分离性能上往往表现不佳,并且缺乏多模态控制。在本文中,我们提出了一个新型的端到端3D意识图像生成和编辑模型,包括纯噪声、文本和参考图像等多种条件输入。一方面,我们深入探索了3D生成对抗网络(GANs)的潜在空间,并提出了一种新的分离策略,在生成过程中将外观特征与形状特征分离。另一方面,我们提出了一个多模态条件下灵活图像生成和编辑任务的统一框架。我们的方法可以生成具有独特噪声的多样图像,通过文本描述编辑属性,并通过给定参考RGB图像进行风格迁移。大量实验证明,与 alternative 方法相比,所提出的方法在图像生成和编辑方面都表现出优异的性能。
https://arxiv.org/abs/2403.06470
While existing motion style transfer methods are effective between two motions with identical content, their performance significantly diminishes when transferring style between motions with different contents. This challenge lies in the lack of clear separation between content and style of a motion. To tackle this challenge, we propose a novel motion style transformer that effectively disentangles style from content and generates a plausible motion with transferred style from a source motion. Our distinctive approach to achieving the goal of disentanglement is twofold: (1) a new architecture for motion style transformer with 'part-attentive style modulator across body parts' and 'Siamese encoders that encode style and content features separately'; (2) style disentanglement loss. Our method outperforms existing methods and demonstrates exceptionally high quality, particularly in motion pairs with different contents, without the need for heuristic post-processing. Codes are available at this https URL.
尽管现有的动作风格迁移方法在相同内容的两个动作之间非常有效,但当将风格从一个具有不同内容的动作迁移时,它们的性能会显著下降。这个挑战在于动作内容与风格之间的明确分离。为了解决这个问题,我们提出了一个新颖的动作风格变换器,它有效地将风格与内容分离,并从源动作生成具有转移风格的合乎情理的运动。我们实现这个目标的双重方法是:(1)一个新的动作风格变换器架构,包括“全身关注风格调节器”和“双编码器,分别编码风格和内容特征”;(2)风格解耦损失。我们的方法超越了现有的方法,并展示了在不需要启发式后处理的情况下在具有不同内容的动作对中具有卓越的质量。代码可在此处访问:https://url.cn/
https://arxiv.org/abs/2403.06225
Current 3D stylization methods often assume static scenes, which violates the dynamic nature of our real world. To address this limitation, we present S-DyRF, a reference-based spatio-temporal stylization method for dynamic neural radiance fields. However, stylizing dynamic 3D scenes is inherently challenging due to the limited availability of stylized reference images along the temporal axis. Our key insight lies in introducing additional temporal cues besides the provided reference. To this end, we generate temporal pseudo-references from the given stylized reference. These pseudo-references facilitate the propagation of style information from the reference to the entire dynamic 3D scene. For coarse style transfer, we enforce novel views and times to mimic the style details present in pseudo-references at the feature level. To preserve high-frequency details, we create a collection of stylized temporal pseudo-rays from temporal pseudo-references. These pseudo-rays serve as detailed and explicit stylization guidance for achieving fine style transfer. Experiments on both synthetic and real-world datasets demonstrate that our method yields plausible stylized results of space-time view synthesis on dynamic 3D scenes.
当前的3D风格化方法通常假定静态的场景,这违反了我们现实世界的动态性质。为了解决这个问题,我们提出了S-DyRF,一种基于参考的动态神经辐射场风格化方法。然而,由于在时间轴上 stylized 参考图像的有限可用性,风格化动态3D场景具有挑战性。我们的关键见解在于引入额外的时变提示。为此,我们从给定的风格参考中生成时变伪参考。这些伪参考有助于在参考和整个动态3D场景之间传播风格信息。为了粗粒度风格转移,我们要求新视图和时间与特征级别上的伪参考的样式相匹配。为了保留高频细节,我们创建了一组风格化的时变伪射线。这些伪射线作为实现细粒度风格转移的详细和明确的指导。在合成和真实世界数据集上进行的实验证明,我们的方法在动态3D场景上产生了合理的风格化结果。
https://arxiv.org/abs/2403.06205
Understanding how visual information is encoded in biological and artificial systems often requires vision scientists to generate appropriate stimuli to test specific hypotheses. Although deep neural network models have revolutionized the field of image generation with methods such as image style transfer, available methods for video generation are scarce. Here, we introduce the Spatiotemporal Style Transfer (STST) algorithm, a dynamic visual stimulus generation framework that allows powerful manipulation and synthesis of video stimuli for vision research. It is based on a two-stream deep neural network model that factorizes spatial and temporal features to generate dynamic visual stimuli whose model layer activations are matched to those of input videos. As an example, we show that our algorithm enables the generation of model metamers, dynamic stimuli whose layer activations within our two-stream model are matched to those of natural videos. We show that these generated stimuli match the low-level spatiotemporal features of their natural counterparts but lack their high-level semantic features, making it a powerful paradigm to study object recognition. Late layer activations in deep vision models exhibited a lower similarity between natural and metameric stimuli compared to early layers, confirming the lack of high-level information in the generated stimuli. Finally, we use our generated stimuli to probe the representational capabilities of predictive coding deep networks. These results showcase potential applications of our algorithm as a versatile tool for dynamic stimulus generation in vision science.
理解生物和人工智能系统中视觉信息的编码通常需要 vision 科学家生成适当的刺激来测试特定假设。尽管深度神经网络模型通过诸如图像风格转移等方法已经极大地推动了图像生成领域的发展,但用于视频生成的现有方法却很少。在这里,我们介绍了 Spatiotemporal Style Transfer(STST)算法,一种动态视频刺激生成框架,允许视觉研究界强大地操作和合成视频刺激。它基于两个深度神经网络模型,将空间和时间特征进行分解,生成动态视觉刺激,其模型层激活与输入视频相匹配。例如,我们展示了我们的算法能够生成模型同构体,即与自然视频具有相同层激活的动态刺激。我们证明了这些生成的刺激与它们的自然对应物的低级空间和时间特征相匹配,但缺乏高级语义特征,这使得研究对象识别是一个强大的范式。在深度视觉模型中,后层激活与自然和同构刺激之间的相似性较低,证实了生成的刺激中缺乏高级信息。最后,我们使用生成的刺激来研究预测编码深度网络的表示能力。这些结果展示了我们算法的潜在应用作为视觉科学研究中动态刺激生成的多功能工具。
https://arxiv.org/abs/2403.04940
We study text-based image editing (TBIE) of a single image by counterfactual inference because it is an elegant formulation to precisely address the requirement: the edited image should retain the fidelity of the original one. Through the lens of the formulation, we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity, mainly due to the overfitting of the single-image fine-tuning. To this end, we propose a Doubly Abductive Counterfactual inference framework (DAC). We first parameterize an exogenous variable as a UNet LoRA, whose abduction can encode all the image details. Second, we abduct another exogenous variable parameterized by a text encoder LoRA, which recovers the lost editability caused by the overfitted first abduction. Thanks to the second abduction, which exclusively encodes the visual transition from post-edit to pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit back to post-edit, thereby accomplishing the edit. Through extensive experiments, our DAC achieves a good trade-off between editability and fidelity. Thus, we can support a wide spectrum of user editing intents, including addition, removal, manipulation, replacement, style transfer, and facial change, which are extensively validated in both qualitative and quantitative evaluations. Codes are in this https URL.
我们通过反事实推理来研究基于文本的图像编辑(TBIE)单个图像,因为这是一个优雅的公式,可以精确满足要求:编辑后的图像应保留原始图像的准确性。通过公式的透镜,我们发现TBIE的关键在于现有技术很难在编辑性和准确性之间实现良好的权衡,主要原因是单图像微调的过拟合。为此,我们提出了一个双重抽样反事实推理框架(DAC)。我们首先将外生变量参数化为一个UNet LoRA,其推理可以编码所有图像细节。然后,我们将另一个外生变量参数化为一个文本编码器LoRA,它通过过拟合第一次推理来恢复由过拟合引起的编辑失真。由于第二个推理仅编码了从后编辑到前编辑的视觉转换,因此它的逆置--减去LoRA--有效地将前编辑恢复到后编辑,从而实现了编辑。通过广泛的实验,我们的DAC在编辑性和准确性之间实现了良好的权衡。因此,我们可以支持广泛的用户编辑意图,包括添加、删除、操作、替换、风格转移和面部变化,这些都在定性和定量评估中得到了广泛验证。代码可在此处访问:<https://url.cn/xyz01>
https://arxiv.org/abs/2403.02981
Text Style Transfer (TST) seeks to alter the style of text while retaining its core content. Given the constraints of limited parallel datasets for TST, we propose CoTeX, a framework that leverages large language models (LLMs) alongside chain-of-thought (CoT) prompting to facilitate TST. CoTeX distills the complex rewriting and reasoning capabilities of LLMs into more streamlined models capable of working with both non-parallel and parallel data. Through experimentation across four TST datasets, CoTeX is shown to surpass traditional supervised fine-tuning and knowledge distillation methods, particularly in low-resource settings. We conduct a comprehensive evaluation, comparing CoTeX against current unsupervised, supervised, in-context learning (ICL) techniques, and instruction-tuned LLMs. Furthermore, CoTeX distinguishes itself by offering transparent explanations for its style transfer process.
文本风格迁移(TST)旨在通过保留文本的核心内容来改变其风格。考虑到TST有限的可并行数据集的限制,我们提出了CoTeX,一个利用大型语言模型(LLMs)和连锁思维(CoT)提示来促进TST的框架。通过在四个TST数据集上的实验,CoTeX证明了LLMs的复杂重构和推理能力可以转化为更简洁的模型,能够处理非并行和并行数据。通过对比实验,CoTeX与当前的无监督、监督、上下文学习(ICL)技术和指令微调的LLM,进行了全面评估。此外,CoTeX通过提供透明的方式来解释其风格迁移过程,使其与其它方法区分明显。
https://arxiv.org/abs/2403.01106
This paper aims to address a common challenge in deep learning-based image transformation methods, such as image enhancement and super-resolution, which heavily rely on precisely aligned paired datasets with pixel-level alignments. However, creating precisely aligned paired images presents significant challenges and hinders the advancement of methods trained on such data. To overcome this challenge, this paper introduces a novel and simple Frequency Distribution Loss (FDL) for computing distribution distance within the frequency domain. Specifically, we transform image features into the frequency domain using Discrete Fourier Transformation (DFT). Subsequently, frequency components (amplitude and phase) are processed separately to form the FDL loss function. Our method is empirically proven effective as a training constraint due to the thoughtful utilization of global information in the frequency domain. Extensive experimental evaluations, focusing on image enhancement and super-resolution tasks, demonstrate that FDL outperforms existing misalignment-robust loss functions. Furthermore, we explore the potential of our FDL for image style transfer that relies solely on completely misaligned data. Our code is available at: this https URL
本文旨在解决深度学习图像变换方法中一个普遍的挑战,例如图像增强和超分辨率,这些方法高度依赖像素级对齐的配对数据。然而,精确对齐配对图像存在重大挑战,阻碍了基于此类数据训练的方法的进步。为解决这个问题,本文引入了一种新颖且简单的频谱分布损失(FDL)用于计算频域中的分布距离。具体来说,我们使用离散傅里叶变换(DFT)将图像特征转换到频域。然后,对频谱成分(幅度和相位)分别进行处理,形成FDL损失函数。我们的方法通过在频域中明智地利用全局信息,作为训练约束,已被实验证明在图像增强和超分辨率任务中具有优越性能。此外,我们还探讨了我们的FDL在仅依赖完全错配数据进行图像风格转移方面的潜力。我们的代码可在此处下载:https://this URL。
https://arxiv.org/abs/2402.18192