Abstract
Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands.
Abstract (translated)
文本到图像生成模型可以生成高质量的人类,但在生成手部时会丢失真实性。常见的问题包括不规则的手姿势、形状、手指数量的不正确以及不现实的手指方向。为了生成具有现实手部的图像,我们提出了一个新颖的扩散基础架构,称为HanDiffuser,通过在生成过程中注入手部嵌入来实现真实性。HanDiffuser由两个组件组成:一个从输入文本提示中生成SMPL-Body和MANO-Hand参数的Text-to-Hand-Params扩散模型,和一个根据先前的组件生成的提示和手部参数合成图像的Text-Guided Hand-Params-to-Image扩散模型。我们结合了多个手部表示方面,包括3D形状和关节级别的手指位置、方向和关节活动,以实现稳健的学习和可靠的推理性能。我们进行了广泛的定量实验和用户研究,以证明我们方法在生成高质量手部图像方面的有效性。
URL
https://arxiv.org/abs/2403.01693