Abstract
Human face generation and editing represent an essential task in the era of computer vision and the digital world. Recent studies have shown remarkable progress in multi-modal face generation and editing, for instance, using face segmentation to guide image generation. However, it may be challenging for some users to create these conditioning modalities manually. Thus, we introduce M3Face, a unified multi-modal multilingual framework for controllable face generation and editing. This framework enables users to utilize only text input to generate controlling modalities automatically, for instance, semantic segmentation or facial landmarks, and subsequently generate face images. We conduct extensive qualitative and quantitative experiments to showcase our frameworks face generation and editing capabilities. Additionally, we propose the M3CelebA Dataset, a large-scale multi-modal and multilingual face dataset containing high-quality images, semantic segmentations, facial landmarks, and different captions for each image in multiple languages. The code and the dataset will be released upon publication.
Abstract (translated)
人类面部生成和编辑是计算机视觉和数字世界中的重要任务。最近的研究表明,多模态面部生成和编辑取得了显著进展,例如,通过面部分割来指导图像生成。然而,对于某些用户来说,手动创建这些调节模块可能具有挑战性。因此,我们引入了M3Face,一个可控制的多模态多语言框架,用于可控制的面部生成和编辑。该框架允许用户仅通过文本输入自动生成控制模块,例如语义分割或面部关键点,并随后生成面部图像。我们对我们的框架进行广泛的定性和定量实验,以展示其面部生成和编辑能力。此外,我们还提出了M3CelebA数据集,一个包含高质量图像、语义分割、面部关键点以及多种语言中每个图像的多个描述的大型多模态多语言面部数据集。代码和数据集将在发表时发布。
URL
https://arxiv.org/abs/2402.02369