Abstract
In this pioneering study, we introduce StyleWallfacer, a groundbreaking unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to revolutionize the field by enabling artist level style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering artist-level style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time.
Abstract (translated)
在这项开创性的研究中,我们介绍了StyleWallfacer,这是一种突破性的统一训练和推理框架。它不仅解决了传统方法在风格转换过程中遇到的各种问题,而且还为不同的任务提供了一个统一的框架。该框架旨在通过实现艺术家级别的风格转换和文本驱动的美化来革新这一领域。 首先,我们提出了一种基于语义的风格注入方法,利用BLIP生成与样式图像语义严格对齐的CLIP空间中的文本描述。通过使用大型语言模型从这些描述中删除与风格相关的信息,我们创建了一个语义差距。然后利用这个差距来微调模型,从而使风格知识的有效且无漂移的注入成为可能。 其次,我们提出了一种基于人类反馈的数据增强策略,将早期微调过程中生成的高质量样本纳入训练集,以促进渐进式学习并显著减少过拟合现象。 最后,我们设计了一个无需训练的三重扩散过程,使用经过微调的模型,在自注意力层的操作方式上类似于跨注意力机制。具体而言,在生成过程中,内容相关的键和值被替换为风格相关的键和值,以注入风格的同时保持对文本的控制。我们也引入了查询保留来减轻对原始内容的干扰。 在这样的设计下,我们实现了高质量的基于图像的样式转换以及文本驱动的美化,并提供了艺术家级别的样式转换结果,同时保存了原始图像的内容。此外,在风格转换过程中首次实现了对图像颜色进行编辑。
URL
https://arxiv.org/abs/2506.15033