Paper Reading AI Learner

Break Stylistic Sophon: Are We Really Meant to Confine the Imagination in Style Transfer?

2025-06-18 00:24:29
Gary Song Yan, Yusen Zhang, Jinyu Zhao, Hao Zhang, Zhangping Yang, Guanye Xiong, Yanfei Liu, Tao Zhang, Yujie He, Siyuan Tian, Yao Gou, Min Li

Abstract

In this pioneering study, we introduce StyleWallfacer, a groundbreaking unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to revolutionize the field by enabling artist level style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering artist-level style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time.

Abstract (translated)

在这项开创性的研究中,我们介绍了StyleWallfacer,这是一种突破性的统一训练和推理框架。它不仅解决了传统方法在风格转换过程中遇到的各种问题,而且还为不同的任务提供了一个统一的框架。该框架旨在通过实现艺术家级别的风格转换和文本驱动的美化来革新这一领域。 首先,我们提出了一种基于语义的风格注入方法,利用BLIP生成与样式图像语义严格对齐的CLIP空间中的文本描述。通过使用大型语言模型从这些描述中删除与风格相关的信息,我们创建了一个语义差距。然后利用这个差距来微调模型,从而使风格知识的有效且无漂移的注入成为可能。 其次,我们提出了一种基于人类反馈的数据增强策略,将早期微调过程中生成的高质量样本纳入训练集,以促进渐进式学习并显著减少过拟合现象。 最后,我们设计了一个无需训练的三重扩散过程,使用经过微调的模型,在自注意力层的操作方式上类似于跨注意力机制。具体而言,在生成过程中,内容相关的键和值被替换为风格相关的键和值,以注入风格的同时保持对文本的控制。我们也引入了查询保留来减轻对原始内容的干扰。 在这样的设计下,我们实现了高质量的基于图像的样式转换以及文本驱动的美化,并提供了艺术家级别的样式转换结果,同时保存了原始图像的内容。此外,在风格转换过程中首次实现了对图像颜色进行编辑。

URL

https://arxiv.org/abs/2506.15033

PDF

https://arxiv.org/pdf/2506.15033.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot