Diffusion models have recently shown the ability to generate high-quality images. However, controlling its generation process still poses challenges. The image style transfer task is one of those challenges that transfers the visual attributes of a style image to another content image. Typical obstacle of this task is the requirement of additional training of a pre-trained model. We propose a training-free style transfer algorithm, Style Tracking Reverse Diffusion Process (STRDP) for a pretrained Latent Diffusion Model (LDM). Our algorithm employs Adaptive Instance Normalization (AdaIN) function in a distinct manner during the reverse diffusion process of an LDM while tracking the encoding history of the style image. This algorithm enables style transfer in the latent space of LDM for reduced computational cost, and provides compatibility for various LDM models. Through a series of experiments and a user study, we show that our method can quickly transfer the style of an image without additional training. The speed, compatibility, and training-free aspect of our algorithm facilitates agile experiments with combinations of styles and LDMs for extensive application.
扩散模型最近展示了生成高质量图像的能力。然而,控制其生成过程仍然具有挑战性。图像风格迁移任务是那种将样式图像的视觉属性转移到另一个内容图像的挑战之一。这个任务的典型障碍是需要对预训练模型进行额外的训练。我们提出了一个无需训练的样式迁移算法:Style Tracking Reverse Diffusion Process (STRDP) for a Pre-trained Latent Diffusion Model (LDM)。 在我们的算法中,在LDM的逆扩散过程中采用了一种与通常的样式迁移方法不同的自适应实例归一化(AdaIN)函数来跟踪样式图像的编码历史。这个算法可以在降低计算成本的情况下将LDM中的样式进行迁移,并为各种LDM模型提供兼容性。通过一系列实验和用户研究,我们证明了我们的方法可以快速将图像的风格转移到不需要额外训练。这种算法的速度、兼容性和无需训练的特点使得它可以与各种风格和LDM模型进行灵活的实验,为广泛的应用提供了支持。
https://arxiv.org/abs/2410.01366
Text style transfer (TST) aims to modify the style of a text without altering its original meaning. Large language models (LLMs) demonstrate superior performance across multiple tasks, including TST. However, in zero-shot setups, they tend to directly copy a significant portion of the input text to the output without effectively changing its style. To enhance the stylistic variety and fluency of the text, we present sNeuron-TST, a novel approach for steering LLMs using style-specific neurons in TST. Specifically, we identify neurons associated with the source and target styles and deactivate source-style-only neurons to give target-style words a higher probability, aiming to enhance the stylistic diversity of the generated text. However, we find that this deactivation negatively impacts the fluency of the generated text, which we address by proposing an improved contrastive decoding method that accounts for rapid token probability shifts across layers caused by deactivated source-style neurons. Empirical experiments demonstrate the effectiveness of the proposed method on six benchmarks, encompassing formality, toxicity, politics, politeness, authorship, and sentiment.
文本风格迁移(TST)旨在在不改变文本原意的情况下修改文本的风格。大型语言模型(LLMs)在多个任务上表现出色,包括TST。然而,在零散设置中,它们倾向于直接复制输入文本的显著部分到输出,而没有有效地改变其风格。为了增强文本的文体多样性和流畅性,我们提出了sNeuron-TST,一种通过TST中的风格特定神经元引导LLM的新方法。具体来说,我们识别与源和目标风格相关的神经元,并关闭源风格仅的神经元,以使目标风格的单词具有更高的概率,旨在增强生成的文本的文体多样性。然而,我们发现这种关闭会降低生成的文本的流畅性,我们通过提出一种改进的对比解码方法来解决这个问题的方法,该方法考虑了由关闭的源风格神经元引起的各层快速概率转移。实验数据表明,所提出的方法在六个基准测试中都取得了有效成果,涵盖形式、毒性、政治、谦逊、作者和情感。
https://arxiv.org/abs/2410.00593
While style transfer techniques have been well-developed for 2D image stylization, the extension of these methods to 3D scenes remains relatively unexplored. Existing approaches demonstrate proficiency in transferring colors and textures but often struggle with replicating the geometry of the scenes. In our work, we leverage an explicit Gaussian Splatting (GS) representation and directly match the distributions of Gaussians between style and content scenes using the Earth Mover's Distance (EMD). By employing the entropy-regularized Wasserstein-2 distance, we ensure that the transformation maintains spatial smoothness. Additionally, we decompose the scene stylization problem into smaller chunks to enhance efficiency. This paradigm shift reframes stylization from a pure generative process driven by latent space losses to an explicit matching of distributions between two Gaussian representations. Our method achieves high-resolution 3D stylization by faithfully transferring details from 3D style scenes onto the content scene. Furthermore, WaSt-3D consistently delivers results across diverse content and style scenes without necessitating any training, as it relies solely on optimization-based techniques. See our project page for additional results and source code: $\href{this https URL}{this https URL}$.
尽管在2D图像风格化方面,已经发展到了相对成熟的阶段,但将这些方法扩展到3D场景仍然是一个相对未经探索的领域。现有的方法擅长将颜色和纹理从一个场景复制到另一个场景,但往往难以复制场景的形状。在我们的工作中,我们利用显式的高斯散射(GS)表示,并直接使用地球移动距离(EMD)在风格和内容场景之间匹配高斯分布。通过采用熵正则化的Wasserstein-2距离,我们确保变换保持空间平滑性。此外,我们将场景风格化问题分解成较小的部分,以提高效率。这一范式转变将风格化从由潜在空间损失驱动的纯生成过程转变为在两个高斯表示之间进行显式匹配。我们的方法通过从3D风格场景中忠实传递细节到内容场景实现了高分辨率3D风格化。此外,WaSt-3D在多样内容和国家风格场景上始终表现出卓越的结果,而无需进行训练,因为它仅依赖于优化技术。您可以在我们的项目页上查看更多信息和源代码:https://this URL。
https://arxiv.org/abs/2409.17917
Incorporating cross-speaker style transfer in text-to-speech (TTS) models is challenging due to the need to disentangle speaker and style information in audio. In low-resource expressive data scenarios, voice conversion (VC) can generate expressive speech for target speakers, which can then be used to train the TTS model. However, the quality and style transfer ability of the VC model are crucial for the overall TTS model quality. In this work, we explore the use of synthetic data generated by a VC model to assist the TTS model in cross-speaker style transfer tasks. Additionally, we employ pre-training of the style encoder using timbre perturbation and prototypical angular loss to mitigate speaker leakage. Our results show that using VC synthetic data can improve the naturalness and speaker similarity of TTS in cross-speaker scenarios. Furthermore, we extend this approach to a cross-language scenario, enhancing accent transfer.
将跨说话人风格迁移融入文本到语音(TTS)模型中具有挑战性,因为需要解开音频中说话人和风格信息。在低资源表现数据场景中,语音转换(VC)可以为目标说话人生成表现优异的语音,然后用于训练TTS模型。然而,VC模型的质量和风格迁移能力对整个TTS模型质量至关重要。在这项工作中,我们探讨了使用由VC模型生成的合成数据来帮助TTS模型进行跨说话人风格迁移任务。此外,我们还使用基于时域扰动和原型角损失的风格编码器的预训练来减轻说话人泄漏。我们的结果表明,使用VC合成数据可以提高跨说话人场景中TTS的自然性和说话人相似度。此外,我们还将这种方法扩展到跨语言场景,增强语调转移。
https://arxiv.org/abs/2409.17364
Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce StyleSinger 2, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, StyleSinger 2 proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S\&D-LM) concurrently predicts style information and phoneme duration, which benefits both; 3) the style adaptive decoder uses a novel mel-style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that StyleSinger 2 outperforms all baseline models in synthesis quality, singer similarity, and style controllability across various tasks, including zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer. Singing voice samples can be accessed at this https URL.
零 shot 唱歌语音合成(SVS)与风格转移和风格控制旨在从音频和文本提示中生成具有未见到的时域和风格的优质歌唱声音(包括歌唱方法、情感、节奏、技术和发音) 。然而,唱歌风格的复杂性对有效的建模、传输和控制构成了重大挑战。此外,当前的SVS模型往往无法生成未见过的歌手的富有风格细微变化的唱歌声音。为了应对这些挑战,我们引入了StyleSinger 2,第一个零 shot SVS模型,用于跨语言语音和唱歌风格的风格转移,以及多级风格控制。具体来说,StyleSinger 2提出了三个主要模块:1)聚类风格编码器采用聚类向量量化模型将风格信息稳定压缩成紧凑的潜在空间;2)风格和持续语言模型(S\&D-LM)同时预测风格信息和音节持续时间,有益于两者;3)风格自适应解码器使用一种新颖的 Mel-风格自适应归一化方法生成具有增强细节的唱歌声音。实验结果表明,StyleSinger 2在合成质量、歌手相似度和风格可控制性各种任务上都优于所有基线模型,包括零 shot 风格转移、多级风格控制、跨语言风格转移和语音到唱歌风格转移。唱歌声音样本可以在这个链接访问:https://url.cn/
https://arxiv.org/abs/2409.15977
Arbitrary artistic style transfer is a field that integrates rational academic research with emotional artistic creation. It aims to produce an image that not only features artistic characteristics of the target style but also preserves the texture structure of the content image itself. Existing style transfer methods primarily rely either on global statistics-based information or local patch-based. As a result, the generated images often either superficially apply a filter to the content image or capture extraneous semantic information from the style image, leading to a significant deviation from the global style. In this paper, we propose Affinity Enhanced-Attentional Networks (AEANet), which include a content affinity-enhanced attention (CAEA) module, style affinity-enhanced attention (SAEA) module, and hybrid attention (HA) module. The CAEA and SAEA modules first use attention to improve content and style representations with a Detail Enhanced(DE) module to reinforce fine details. Then, it aligns the global statistical information of the content and style features to fine-tune the feature information. Subsequently, the HA module adjusts the distribution of style features based on the distribution of content features. Additionally, we introduce affinity attention-based Local Dissimilarity Loss to preserve the affinities between the content and style images. Experimental results demonstrate that our approach outperforms state-of-the-art methods in arbitrary style transfer.
任意艺术风格迁移是一个将理性学术研究和情感艺术创作相结合的领域。其旨在产生一个不仅具有目标风格的艺术特征,而且保留内容图像本身的纹理结构的图像。现有的风格迁移方法主要依赖于全局统计信息或局部补丁基于。因此,生成的图像通常要么浅层地应用内容图像的滤镜,要么从风格图像中捕获多余的语义信息,导致全局风格的显著偏差。在本文中,我们提出了自适应增强注意网络(AEANet)模型,包括内容相关性增强注意(CAEA)模块、风格相关性增强注意(SAEA)模块和混合注意(HA)模块。CAEA和SAEA模块首先使用关注力改善内容和方法论代表的细节,然后将内容和风格特征的全球统计信息对齐以微调特征信息。接下来,HA模块根据内容特征的分布调整风格特征的分布。此外,我们还引入了基于亲和注意的局部差异损失来保留内容和方法论之间的关联。实验结果表明,我们的方法在任意风格迁移方面超过了最先进的方法。
https://arxiv.org/abs/2409.14652
The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present \textbf{GTSinger}, a large \textbf{G}lobal, multi-\textbf{T}echnique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality singing voices, forming the largest recorded singing dataset; (2) 20 professional singers across nine widely spoken languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six commonly used singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompanied by manual phoneme-to-audio alignments, global style labels, and 16.16 hours of paired speech for various singing tasks. Moreover, to facilitate the use of GTSinger, we conduct four benchmark experiments: technique-controllable singing voice synthesis, technique recognition, style transfer, and speech-to-singing conversion. The corpus and demos can be found at this http URL. We provide the dataset and the code for processing data and conducting benchmarks at this https URL and this https URL.
高质量和多任务 singing 数据集的稀缺性显著阻碍了多样可控制和个人化 singing 任务的开发,因为现有的 singing 数据集存在低质量、语言和歌手多样性有限、缺乏多技术信息以及不真实的音乐分数,以及任务不适用等问题。为了解决这些问题,我们提出了 \textbf{GTSinger},一个大型 \textbf{G}全局,多技术,免费使用,高质量 singing 数据集,包括真实的音乐分数,专为所有 singing 任务设计,并附有其基准。特别地,(1)我们收集了 80.59 小时的优质 singing 声音,形成了最大的录音 singing 数据集;(2)九个广泛使用的语言中的 20 名专业歌手提供了多样和风格的音色;(3)我们提供了对六种常用 singing 技术的受控比较和音素级注释,帮助技术建模和控制;(4)GTSinger 提供了真实的音乐分数,辅助现实世界的音乐创作;(5)singing 声音由手动音素到音频对齐、全局风格标签和各种 singing 任务的成对 speech 伴有。此外,为了方便使用 GTSinger,我们还进行了四个基准实验:技术可控的 singing 声音合成、技术识别、风格转换和语音到 singing 转换。数据集和演示文稿可以在这个网址找到。我们还在这个网址和这个网址提供了数据集和代码用于处理数据和进行基准实验。
https://arxiv.org/abs/2409.13832
Text-based diffusion video editing systems have been successful in performing edits with high fidelity and textual alignment. However, this success is limited to rigid-type editing such as style transfer and object overlay, while preserving the original structure of the input video. This limitation stems from an initial latent noise employed in diffusion video editing systems. The diffusion video editing systems prepare initial latent noise to edit by gradually infusing Gaussian noise onto the input video. However, we observed that the visual structure of the input video still persists within this initial latent noise, thereby restricting non-rigid editing such as motion change necessitating structural modifications. To this end, this paper proposes Dilutional Noise Initialization (DNI) framework which enables editing systems to perform precise and dynamic modification including non-rigid editing. DNI introduces a concept of `noise dilution' which adds further noise to the latent noise in the region to be edited to soften the structural rigidity imposed by input video, resulting in more effective edits closer to the target prompt. Extensive experiments demonstrate the effectiveness of the DNI framework.
基于文本的视频编辑系统在保留输入视频原始结构的同时,以高保真度和文本对齐对编辑进行成功。然而,这种成功仅限于刚性类型的编辑,如风格迁移和对象覆盖,而没有保留输入视频的原始结构。这一局限源于扩散视频编辑系统采用的初始潜在噪声。扩散视频编辑系统通过逐步向输入视频注入高斯噪声来准备初始潜在噪声,以编辑视频。然而,我们观察到,在初始潜在噪声内,输入视频的视觉结构仍然存在,从而限制了非刚性编辑,如需要对结构进行修改的运动变化。为此,本文提出了稀释噪声初始化(DNI)框架,使编辑系统能够执行精确和动态的修改,包括非刚性编辑。DNI引入了一个“噪声稀释”的概念,向需要编辑的区域添加额外噪声,以软化输入视频对结构强度的约束,从而实现更有效的编辑,更接近目标提示。大量实验证明,DNI框架的有效性。
https://arxiv.org/abs/2409.13037
Western music is often characterized by a homophonic texture, in which the musical content can be organized into a melody and an accompaniment. In orchestral music, in particular, the composer can select specific characteristics for each instrument's part within the accompaniment, while also needing to adapt the melody to suit the capabilities of the instruments performing it. In this work, we propose METEOR, a model for Melody-aware Texture-controllable Orchestral music generation. This model performs symbolic multi-track music style transfer with a focus on melodic fidelity. We allow bar- and track-level controllability of the accompaniment with various textural attributes while keeping a homophonic texture. We show that the model can achieve controllability performances similar to strong baselines while greatly improve melodic fidelity.
西方音乐通常具有和声 texture,其中音乐内容可以组织成旋律和伴奏。在管弦音乐中,尤其是作曲家,可以在伴奏中为每个乐器部分选择特定的特征,同时还需要适应演奏这些乐器的乐器特性来适应旋律。在这项工作中,我们提出了METEOR,一个适用于旋律感知 Texture-controllable Orchestral 音乐生成的模型。这个模型专注于旋律忠诚度,具有符号多轨音乐风格转移。我们在保持和声的同时允许对伴奏的各种文本属性进行级控制。我们证明了这个模型可以在类似于强基线的控制性能的同时大大提高旋律忠诚度。
https://arxiv.org/abs/2409.11753
The goal of style transfer is, given a content image and a style source, generating a new image preserving the content but with the artistic representation of the style source. Most of the state-of-the-art architectures use transformers or diffusion-based models to perform this task, despite the heavy computational burden that they require. In particular, transformers use self- and cross-attention layers which have large memory footprint, while diffusion models require high inference time. To overcome the above, this paper explores a novel design of Mamba, an emergent State-Space Model (SSM), called Mamba-ST, to perform style transfer. To do so, we adapt Mamba linear equation to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output, but drastically reducing memory usage and time complexity. We modified the Mamba's inner equations so to accept inputs from, and combine, two separate data streams. To the best of our knowledge, this is the first attempt to adapt the equations of SSMs to a vision task like style transfer without requiring any other module like cross-attention or custom normalization layers. An extensive set of experiments demonstrates the superiority and efficiency of our method in performing style transfer compared to transformers and diffusion models. Results show improved quality in terms of both ArtFID and FID metrics. Code is available at this https URL.
风格迁移的目标是,给定一个内容图像和一个风格来源,生成一个新的图像,保留内容但保留风格来源的艺术表现。大多数最先进的架构使用变压器或扩散基础模型来执行这项任务,尽管它们需要大量计算负担。特别是,变压器使用具有大内存足迹的自和跨注意力层,而扩散模型需要高推理时间。为了克服上述问题,本文探索了一种名为Mamba-ST的新型Mamba架构,用于进行风格迁移。为此,我们将Mamba的线性方程进行修改,以模拟跨注意力层的行为,这些层能够将两个单独的嵌入合并为单个输出,但大大减少了内存使用和时间复杂度。我们将Mamba的内部方程改为接受来自两个单独数据流的输入,并合并它们。据我们所知,这是第一个将SSM的方程适应视觉任务(如风格迁移)的尝试,而无需添加其他模块(如跨注意或自定义归一化层)。一系列实验证明了我们在风格迁移方面的优越性和高效性。结果表明,与变压器和扩散模型相比,我们的方法在质量和效率方面都具有优势。代码可以从该链接的URL中获取。
https://arxiv.org/abs/2409.10385
Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi-style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter is proposed. Unlike traditional static global style transfer, this method extracts style embedding through cross-attention between text and reference audio for adaptive style control. Adaptive layer normalization is then utilized to enhance the model's capacity to express multiple styles. Additionally, the Sound Event Reference Style Transfer Dataset (SERST) is introduced for the proposed target style audio generation task, enabling dual-prompt audio generation using both text and audio references. Experimental results demonstrate the robustness of the model, achieving state-of-the-art Fréchet Distance of 26.94 and KL Divergence of 1.82, surpassing Tango, AudioLDM, and AudioGen. Furthermore, the generated audio shows high similarity to its corresponding audio reference. The demo, code, and dataset are publicly available.
目前主流的音频生成方法主要依赖于简单的文本提示,往往无法捕捉到多风格音频生成的细微细节。为了克服这一局限,我们提出了Sound Event Enhanced Prompt Adapter。与传统的静态全局风格转移不同,这种方法通过文本和参考音频之间的交叉注意力来提取风格嵌入,实现自适应风格控制。然后,利用自适应层归一化来增强模型表达多种风格的能力。此外,还引入了Sound Event Reference Style Transfer Dataset(SERST)用于提出的目标风格音频生成任务,使得可以使用文本和音频作为参考来生成双提示音频。实验结果证明了模型的稳健性,达到最优的弗雷切距离为26.94,kl散度为1.82,超过了Tango、AudioLDM和AudioGen。此外,生成的音频与相应音频参考具有很高的相似性。演示、代码和数据集都可以公开使用。
https://arxiv.org/abs/2409.09381
Over the last decade, deep learning has shown great success at performing computer vision tasks, including classification, super-resolution, and style transfer. Now, we apply it to data compression to help build the next generation of multimedia codecs. This thesis provides three primary contributions to this new field of learned compression. First, we present an efficient low-complexity entropy model that dynamically adapts the encoding distribution to a specific input by compressing and transmitting the encoding distribution itself as side information. Secondly, we propose a novel lightweight low-complexity point cloud codec that is highly specialized for classification, attaining significant reductions in bitrate compared to non-specialized codecs. Lastly, we explore how motion within the input domain between consecutive video frames is manifested in the corresponding convolutionally-derived latent space.
在过去的十年里,深度学习已经在计算机视觉任务中取得了巨大的成功,包括分类、超分辨率和支持向量转移。现在,我们将它应用于数据压缩,以帮助构建下一代的多媒体编码标准。本论文为学习压缩领域提供了三个主要的贡献。首先,我们提出了一种高效且低复杂度的熵模型,通过压缩和传输编码分布本身作为侧信息,动态地适应特定输入。其次,我们提出了一种新颖的轻量级低复杂度点云编码器,专为分类而设计,比非专用编码器在带宽方面显著减少了比特率。最后,我们探讨了连续视频帧之间输入域内运动在相应卷积导出的潜在空间中的表现。
https://arxiv.org/abs/2409.08376
Motion style transfer changes the style of a motion while retaining its content and is useful in computer animations and games. Contact is an essential component of motion style transfer that should be controlled explicitly in order to express the style vividly while enhancing motion naturalness and quality. However, it is unknown how to decouple and control contact to achieve fine-grained control in motion style transfer. In this paper, we present a novel style transfer method for fine-grained control over contacts while achieving both motion naturalness and spatial-temporal variations of style. Based on our empirical evidence, we propose controlling contact indirectly through the hip velocity, which can be further decomposed into the trajectory and contact timing, respectively. To this end, we propose a new model that explicitly models the correlations between motions and trajectory/contact timing/style, allowing us to decouple and control each separately. Our approach is built around a motion manifold, where hip controls can be easily integrated into a Transformer-based decoder. It is versatile in that it can generate motions directly as well as be used as post-processing for existing methods to improve quality and contact controllability. In addition, we propose a new metric that measures a correlation pattern of motions based on our empirical evidence, aligning well with human perception in terms of motion naturalness. Based on extensive evaluation, our method outperforms existing methods in terms of style expressivity and motion quality.
运动风格迁移会改变一个运动的风格,同时保留其内容,在计算机动画和游戏中很有用。接触是运动风格迁移的一个关键组成部分,应该明确地控制以生动表达风格,同时增强动作的自然性和质量。然而,如何解耦并控制接触以实现对运动风格迁移的精细控制仍然是未知的。在本文中,我们提出了一种新的风格迁移方法,可以在保留动作自然性和空间-时间风格变化的同时,对接触进行细粒度控制。基于我们的实证证据,我们提出了一种通过髋关节速度间接控制接触的方法,可以进一步分解为轨迹和接触时间。为此,我们提出了一种新的模型,该模型明确地建模了动作之间的相关性以及轨迹/接触时间/风格之间的相关性,允许我们分别解耦和控制每个。我们的方法基于运动场,其中髋关节控制可以很容易地集成到基于Transformer的编码器中。这种方法的多功能性使得它可以直接生成动作,还可以作为现有方法的后期处理,以提高质量和接触可控制性。此外,我们还提出了一种基于我们实证证据度量的动作相关性模式的新指标,与人类感知在动作自然性方面相似。通过广泛的评估,我们的方法在风格表现力和动作质量方面优于现有方法。
https://arxiv.org/abs/2409.05387
In this paper, we introduce MRStyle, a comprehensive framework that enables color style transfer using multi-modality reference, including image and text. To achieve a unified style feature space for both modalities, we first develop a neural network called IRStyle, which generates stylized 3D lookup tables for image reference. This is accomplished by integrating an interaction dual-mapping network with a combined supervised learning pipeline, resulting in three key benefits: elimination of visual artifacts, efficient handling of high-resolution images with low memory usage, and maintenance of style consistency even in situations with significant color style variations. For text reference, we align the text feature of stable diffusion priors with the style feature of our IRStyle to perform text-guided color style transfer (TRStyle). Our TRStyle method is highly efficient in both training and inference, producing notable open-set text-guided transfer results. Extensive experiments in both image and text settings demonstrate that our proposed method outperforms the state-of-the-art in both qualitative and quantitative evaluations.
在本文中,我们提出了MRStyle,一种全面框架,可实现多模态参考下的颜色风格迁移,包括图像和文本。为了实现模态统一风格特征空间,我们首先开发了IRStyle神经网络,为图像和文本生成拟合的3D查找表。这是通过将交互式双重映射网络与联合监督学习管道集成实现的,从而实现了三个关键优势:消除视觉伪影,高效处理高分辨率图像且内存占用低,以及在色彩风格变化较大的情况下保持风格一致性。对于文本参考,我们将稳定扩散 prior 的文本特征与我们的IRStyle的风格特征对齐,实现文本引导的颜色风格迁移(TRStyle)。我们提出的TRStyle方法在训练和推理过程中都非常高效,产生了显著的开放集文本引导转移结果。在图像和文本设置的广泛实验中,我们的方法在质量和数量评估中均优于现有技术水平。
https://arxiv.org/abs/2409.05250
A robust face recognition model must be trained using datasets that include a large number of subjects and numerous samples per subject under varying conditions (such as pose, expression, age, noise, and occlusion). Due to ethical and privacy concerns, large-scale real face datasets have been discontinued, such as MS1MV3, and synthetic face generators have been proposed, utilizing GANs and Diffusion Models, such as SYNFace, SFace, DigiFace-1M, IDiff-Face, DCFace, and GANDiffFace, aiming to supply this demand. Some of these methods can produce high-fidelity realistic faces, but with low intra-class variance, while others generate high-variance faces with low identity consistency. In this paper, we propose a Triple Condition Diffusion Model (TCDiff) to improve face style transfer from real to synthetic faces through 2D and 3D facial constraints, enhancing face identity consistency while keeping the necessary high intra-class variance. Face recognition experiments using 1k, 2k, and 5k classes of our new dataset for training outperform state-of-the-art synthetic datasets in real face benchmarks such as LFW, CFP-FP, AgeDB, and BUPT. Our source code is available at: this https URL.
一个健壮的人脸识别模型必须通过包括大量受试者和每个受试者大量样本的数据集进行训练,这些数据集在不同的条件下(如姿势、表情、年龄、噪声和遮挡)包括。由于伦理和隐私问题,大规模真实人脸数据集已经停止,例如MS1MV3,并且已经提出了使用GAN和扩散模型(如SYNFace、SFace、DigiFace-1M、IDiff-Face、DCFace和GANDiffFace)合成人脸的方法,旨在满足这一需求。其中一些方法可以产生高质量的高保真度人脸,但内类一致性较低,而其他方法生成具有较低identity consistency的高变异性人脸。在本文中,我们提出了一个三重约束扩散模型(TCDiff)来通过二维和三维面部约束从真实到合成人脸的风格迁移,同时提高身份一致性,保持必要的较高内类一致性。用我们新数据集(包括1k、2k和5k类)进行训练的 face 识别实验在真实人脸基准测试中优于最先进的合成数据集,如 LFW、CFP-FP、AgeDB 和 BUPT。我们的源代码可在此处下载:https://this URL。
https://arxiv.org/abs/2409.03600
One-shot voice conversion(VC) aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing style transfer-based VC methods relied on speech representation disentanglement and suffered from accurately and independently encoding each speech component and recomposing back to converted speech effectively. To tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder as the generator. In the decoder, we used effective styleformer blocks to integrate speaker characteristics into the generated speech effectively. The models used the generative VAE loss for encoding components and triplet loss for unsupervised discriminative training. We applied the styleformer method to Zipformer's shared weights for style transfer. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.
一次性的语音转换(VC)旨在通过仅使用一个语音样本将源语音的音色变换为目标听说话人的音色,实现一次转换。现有的基于风格迁移的VC方法依赖于语音表示的解耦,且在准确度和独立编码每个语音成分并重新合成转化为转换后的语音方面存在困难。为了解决这个问题,我们提出了Pureformer-VC,它利用Conformer模块建立了一个解耦的编码器,并利用Zipformer模块建立了一个风格迁移的解码器作为生成器。在解码器中,我们使用有效的风格变换器模块将说话人的特征融入生成的语音中。该模型使用生成VAE损失对组件进行编码,使用三元组损失进行无监督的判别训练。我们将风格变换方法应用于Zipformer的共享权重进行风格迁移。实验结果表明,与现有方法相比,所提出的模型在单次语音转换场景中实现了相当的主观评分,并显示了在客观指标方面的改进。
https://arxiv.org/abs/2409.01668
This article compares two style transfer methods in image processing: the traditional method, which synthesizes new images by stitching together small patches from existing images, and a modern machine learning-based approach that uses a segmentation network to isolate foreground objects and apply style transfer solely to the background. The traditional method excels in creating artistic abstractions but can struggle with seamlessness, whereas the machine learning method preserves the integrity of foreground elements while enhancing the background, offering improved aesthetic quality and computational efficiency. Our study indicates that machine learning-based methods are more suited for real-world applications where detail preservation in foreground elements is essential.
这篇文章比较了图像处理中两种风格迁移方法:传统方法和基于机器学习的方法。传统方法通过将现有图像的小补丁缝合在一起来合成新图像,而基于机器学习的方法则利用分割网络来分离前景物体,并仅将风格迁移应用于背景。传统方法在创造艺术抽象表现方面表现出色,但可能会出现拼接痕迹不明显的问题,而基于机器学习的方法则保留了前景元素的完整性,同时增强背景,提供了更好的美学质量和计算效率。我们的研究结果表明,基于机器学习的方法更适用于现实世界的应用,其中前景元素的细节保留至关重要。
https://arxiv.org/abs/2409.00606
Portrait sketching involves capturing identity specific attributes of a real face with abstract lines and shades. Unlike photo-realistic images, a good portrait sketch generation method needs selective attention to detail, making the problem challenging. This paper introduces \textbf{Portrait Sketching StyleGAN (PS-StyleGAN)}, a style transfer approach tailored for portrait sketch synthesis. We leverage the semantic $W+$ latent space of StyleGAN to generate portrait sketches, allowing us to make meaningful edits, like pose and expression alterations, without compromising identity. To achieve this, we propose the use of Attentive Affine transform blocks in our architecture, and a training strategy that allows us to change StyleGAN's output without finetuning it. These blocks learn to modify style latent code by paying attention to both content and style latent features, allowing us to adapt the outputs of StyleGAN in an inversion-consistent manner. Our approach uses only a few paired examples ($\sim 100$) to model a style and has a short training time. We demonstrate PS-StyleGAN's superiority over the current state-of-the-art methods on various datasets, qualitatively and quantitatively.
肖像绘画涉及用抽象线条和阴影捕捉真实脸的身份特定特征。与照片现实主义图像相比,一个好的肖像绘画模板生成方法需要对细节进行选择性关注,使得这个问题具有挑战性。本文介绍了一种名为PS-StyleGAN的样式迁移方法,专门用于肖像绘画合成。我们利用StyleGAN的语义$W+$潜在空间生成肖像画,使我们能够在不损害身份的情况下进行有意义的变化,如姿态和表情的改变。为了实现这一点,我们在架构中引入了关注内容和风格的注意力平滑变换块,并采用一种允许我们无需微调StyleGAN输出以改变其风格的学习策略。这些块通过关注内容和风格潜在特征来修改样式潜在代码,使我们在反向一致的方式中适应StyleGAN的输出。我们的方法只需要几对成对的示例($\sim100$)来建模风格,训练时间较短。我们证明了PS-StyleGAN在各种数据集上的优越性,无论是定性的还是定量的。
https://arxiv.org/abs/2409.00345
The diffusion model has shown exceptional capabilities in controlled image generation, which has further fueled interest in image style transfer. Existing works mainly focus on training free-based methods (e.g., image inversion) due to the scarcity of specific data. In this study, we present a data construction pipeline for content-style-stylized image triplets that generates and automatically cleanses stylized data triplets. Based on this pipeline, we construct a dataset IMAGStyle, the first large-scale style transfer dataset containing 210k image triplets, available for the community to explore and research. Equipped with IMAGStyle, we propose CSGO, a style transfer model based on end-to-end training, which explicitly decouples content and style features employing independent feature injection. The unified CSGO implements image-driven style transfer, text-driven stylized synthesis, and text editing-driven stylized synthesis. Extensive experiments demonstrate the effectiveness of our approach in enhancing style control capabilities in image generation. Additional visualization and access to the source code can be located on the project page: \url{this https URL}.
扩散模型在控制图像生成方面表现出了卓越的能力,这进一步推动了图像风格转移的研究兴趣。现有的工作主要集中在基于自由学习的图像生成方法(例如,图像反演)上,因为特定数据的稀缺性。在本文中,我们提出了一个内容风格化图像三元组的数据构建管道,该管道可以生成和自动清理风格化的数据三元组。基于此管道,我们构建了一个名为IMAGStyle的 dataset,这是第一个包含210k个图像三元组的大型风格转移数据集,供社区研究和探索。配备了IMAGStyle,我们提出了CSGO,一种基于端到端训练的图像风格转移模型,它通过独立特征注入明确地解耦了内容和风格特征。统一CSGO实现了图像驱动风格转移、文本驱动风格合成和文本编辑驱动风格合成。广泛的实验证明了我们方法在增强图像生成中的风格控制能力方面的有效性。此外,项目页面上有更多可视化和源代码访问:\url{this <https:// this URL> }。
https://arxiv.org/abs/2408.16766
Hairstyle transfer is a challenging task in the image editing field that modifies the hairstyle of a given face image while preserving its other appearance and background features. The existing hairstyle transfer approaches heavily rely on StyleGAN, which is pre-trained on cropped and aligned face images. Hence, they struggle to generalize under challenging conditions such as extreme variations of head poses or focal lengths. To address this issue, we propose a one-stage hairstyle transfer diffusion model, HairFusion, that applies to real-world scenarios. Specifically, we carefully design a hair-agnostic representation as the input of the model, where the original hair information is thoroughly eliminated. Next, we introduce a hair align cross-attention (Align-CA) to accurately align the reference hairstyle with the face image while considering the difference in their face shape. To enhance the preservation of the face image's original features, we leverage adaptive hair blending during the inference, where the output's hair regions are estimated by the cross-attention map in Align-CA and blended with non-hair areas of the face image. Our experimental results show that our method achieves state-of-the-art performance compared to the existing methods in preserving the integrity of both the transferred hairstyle and the surrounding features. The codes are available at this https URL.
发型迁移图像编辑领域是一项具有挑战性的任务,它在对给定面部图像进行发型修改的同时保留其其他外观和背景特征。现有的发型迁移方法很大程度上依赖于预训练的StyleGAN,它对裁剪和对齐的带标签面部图像进行训练。因此,它们在复杂的情况(如极端的头部姿态或焦距差异)下表现不佳。为了解决这个问题,我们提出了一种一阶段发型迁移扩散模型HairFusion,应用于现实世界场景。具体来说,我们仔细设计了一个头发无关的表示作为输入,其中原始头发信息被完全消除。接下来,我们引入了头发对齐交叉注意(Align-CA)来准确将参考发型与面部图像对齐,同时考虑它们的脸部形状的差异。为了增强保留面部图像原始特征的效果,我们在推理过程中利用自适应发型混合,其中Align-CA的交叉注意图估计输出的头发区域,并将其与面部图像的非头发区域混合。我们的实验结果表明,与其他方法相比,我们的方法在保留转移发型和周围特征的完整性方面取得了最先进的成绩。代码可在此处访问:https://www.thisurl.com
https://arxiv.org/abs/2408.16450