Paper Reading AI Learner

Implicit Style-Content Separation using B-LoRA

2024-03-21 17:20:21
Yarden Frenkel, Yael Vinker, Ariel Shamir, Daniel Cohen-Or

Abstract

Image stylization involves manipulating the visual appearance and texture (style) of an image while preserving its underlying objects, structures, and concepts (content). The separation of style and content is essential for manipulating the image's style independently from its content, ensuring a harmonious and visually pleasing result. Achieving this separation requires a deep understanding of both the visual and semantic characteristics of images, often necessitating the training of specialized models or employing heavy optimization. In this paper, we introduce B-LoRA, a method that leverages LoRA (Low-Rank Adaptation) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of SDXL combined with LoRA, we find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. Consolidating the training into only two blocks and separating style and content allows for significantly improving style manipulation and overcoming overfitting issues often associated with model fine-tuning. Once trained, the two B-LoRAs can be used as independent components to allow various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing.

Abstract (translated)

图像风格化涉及对图像的视觉外观和质感(风格)进行操作,同时保留其潜在的对象、结构和概念(内容)。风格与内容的分离对于独立于内容操作图像的风格至关重要,确保了和谐和视觉愉悦的结果。实现这种分离需要对图像的视觉和语义特征有深入的理解,通常需要通过训练专用模型或使用强大的优化来实现。在本文中,我们介绍了B-LoRA,一种利用LoRA(低秩适应)方法隐含地分离单个图像的样式和内容组件的方法,从而轻松完成各种图像风格化任务。通过分析SDXL与LoRA的架构,我们发现,共同学习两个特定模块(被称为B-LoRAs)的LoRA权重确实实现了样式与内容的分离,而通过独立训练每个B-LoRA,无法实现这种样式与内容的分离。将训练合并为两个模块并分离样式和内容,可以大大改善样式操作,克服通常与模型微调相关的过拟合问题。经过训练后,两个B-LoRAs可以作为独立的组件用于各种图像风格化任务,包括图像风格转移、基于文本的图像风格化、一致风格生成和样式与内容的混合。

URL

https://arxiv.org/abs/2403.14572

PDF

https://arxiv.org/pdf/2403.14572.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot