Abstract
Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.
Abstract (translated)
时尚插图是一种设计师传达其创意愿景并将设计概念转化为实际展示服装与人体之间相互作用的可视化方式。在时尚设计的背景下,计算机视觉技术有可能提高并简化设计过程。从主要关注虚拟试穿的研究中脱离出来,本文解决了多模态条件时尚图像编辑的任务。我们的方法旨在生成以多模态提示为导向的人体为中心的时尚图像,包括文本、人体姿势、服装轮廓和面料纹理。为解决这个问题,我们提出了扩展潜在扩散模型的方法,并修改了去噪网络的结构,将多模态提示作为输入。为了通过面料纹理对所提出的架构进行条件,我们使用了文本反演技术,并让去噪网络的不同跨注意层关注文本和纹理信息,从而包括不同粒度条件的细节。由于缺乏相应的数据集,我们扩展了两个现有的时尚数据集Dress Code和VITON-HD,添加了多模态注释。实验评估表明,我们提出的方法在现实感和连贯性方面具有有效性和可靠性。
URL
https://arxiv.org/abs/2403.14828