Paper Reading AI Learner

Towards Consistent and Controllable Image Synthesis for Face Editing

2025-02-04 16:36:07
Mengting Wei, Tuomas Varanka, Yante Li, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao

Abstract

Current face editing methods mainly rely on GAN-based techniques, but recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in manipulating fine-grained attributes and preserving consistency of attributes that should remain unchanged. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involve combinations of target background, identity and different face attributes. We aim to sufficiently disentangle the control of these factors to enable high-quality of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Arrtibute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) An Identity Encoder that transfers identity features to the denoising UNet of a pre-trained Stable-Diffusion model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.

Abstract (translated)

当前的人脸编辑方法主要依赖于基于GAN的技术,但最近的研究焦点转向了扩散模型(diffusion-based models),因为这些模型在图像重建方面表现出色。然而,扩散模型仍然面临着挑战,即难以精确操纵细粒度属性并保持不应改变的属性的一致性。为了应对这些问题,并促进人脸图像编辑的便利性,我们提出了一种新的方法,该方法利用稳定扩散(Stable-Diffusion)模型和粗糙3D面部模型的力量来控制肖像照片中的光照、面部表情及头部姿态。我们观察到,这项任务本质上涉及目标背景、身份以及不同面部属性的组合。我们的目标是充分解耦这些因素的控制,以实现高质量的人脸编辑。具体来说,我们的方法,命名为RigFace,包含: 1. **空间属性编码器**(Spatial Attribute Encoder):提供精确且独立于其他因素的背景、姿势、表情和光照条件。 2. **身份编码器**(Identity Encoder):将身份特征转移到预训练稳定扩散模型中的去噪UNet中。 3. **属性控制器**(Attribute Rigger):将这些条件注入到去噪UNet中。 我们的模型在与现有面部编辑模型相比时,在身份保持和照片逼真度方面达到了相当甚至更好的性能。

URL

https://arxiv.org/abs/2502.02465

PDF

https://arxiv.org/pdf/2502.02465.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot