Paper Reading AI Learner

Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

2024-03-21 20:43:10
Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara

Abstract

Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.

Abstract (translated)

时尚插图是一种设计师传达其创意愿景并将设计概念转化为实际展示服装与人体之间相互作用的可视化方式。在时尚设计的背景下,计算机视觉技术有可能提高并简化设计过程。从主要关注虚拟试穿的研究中脱离出来,本文解决了多模态条件时尚图像编辑的任务。我们的方法旨在生成以多模态提示为导向的人体为中心的时尚图像,包括文本、人体姿势、服装轮廓和面料纹理。为解决这个问题,我们提出了扩展潜在扩散模型的方法,并修改了去噪网络的结构,将多模态提示作为输入。为了通过面料纹理对所提出的架构进行条件,我们使用了文本反演技术,并让去噪网络的不同跨注意层关注文本和纹理信息,从而包括不同粒度条件的细节。由于缺乏相应的数据集,我们扩展了两个现有的时尚数据集Dress Code和VITON-HD,添加了多模态注释。实验评估表明,我们提出的方法在现实感和连贯性方面具有有效性和可靠性。

URL

https://arxiv.org/abs/2403.14828

PDF

https://arxiv.org/pdf/2403.14828.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot