Paper Reading AI Learner

M$^3$Face: A Unified Multi-Modal Multilingual Framework for Human Face Generation and Editing

2024-02-04 06:56:23
Mohammadreza Mofayezi, Reza Alipour, Mohammad Ali Kakavand, Ehsaneddin Asgari

Abstract

Human face generation and editing represent an essential task in the era of computer vision and the digital world. Recent studies have shown remarkable progress in multi-modal face generation and editing, for instance, using face segmentation to guide image generation. However, it may be challenging for some users to create these conditioning modalities manually. Thus, we introduce M3Face, a unified multi-modal multilingual framework for controllable face generation and editing. This framework enables users to utilize only text input to generate controlling modalities automatically, for instance, semantic segmentation or facial landmarks, and subsequently generate face images. We conduct extensive qualitative and quantitative experiments to showcase our frameworks face generation and editing capabilities. Additionally, we propose the M3CelebA Dataset, a large-scale multi-modal and multilingual face dataset containing high-quality images, semantic segmentations, facial landmarks, and different captions for each image in multiple languages. The code and the dataset will be released upon publication.

Abstract (translated)

人类面部生成和编辑是计算机视觉和数字世界中的重要任务。最近的研究表明,多模态面部生成和编辑取得了显著进展,例如,通过面部分割来指导图像生成。然而,对于某些用户来说,手动创建这些调节模块可能具有挑战性。因此,我们引入了M3Face,一个可控制的多模态多语言框架,用于可控制的面部生成和编辑。该框架允许用户仅通过文本输入自动生成控制模块,例如语义分割或面部关键点,并随后生成面部图像。我们对我们的框架进行广泛的定性和定量实验,以展示其面部生成和编辑能力。此外,我们还提出了M3CelebA数据集,一个包含高质量图像、语义分割、面部关键点以及多种语言中每个图像的多个描述的大型多模态多语言面部数据集。代码和数据集将在发表时发布。

URL

https://arxiv.org/abs/2402.02369

PDF

https://arxiv.org/pdf/2402.02369.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot