Paper Reading AI Learner

Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

2024-04-01 12:43:22
Xu Yang, Changxing Ding, Zhibin Hong, Junhao Huang, Jin Tao, Xiangmin Xu

Abstract

Image-based virtual try-on is an increasingly important task for online shopping. It aims to synthesize images of a specific person wearing a specified garment. Diffusion model-based approaches have recently become popular, as they are excellent at image synthesis tasks. However, these approaches usually employ additional image encoders and rely on the cross-attention mechanism for texture transfer from the garment to the person image, which affects the try-on's efficiency and fidelity. To address these issues, we propose an Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the fidelity of the results and introduces no additional image encoders. Accordingly, we make contributions from two aspects. First, we propose to concatenate the masked person and reference garment images along the spatial dimension and utilize the resulting image as the input for the diffusion model's denoising UNet. This enables the original self-attention layers contained in the diffusion model to achieve efficient and accurate texture transfer. Second, we propose a novel diffusion-based method that predicts a precise inpainting mask based on the person and reference garment images, further enhancing the reliability of the try-on results. In addition, we integrate mask prediction and image synthesis into a single compact model. The experimental results show that our approach can be applied to various try-on tasks, e.g., garment-to-person and person-to-person try-ons, and significantly outperforms state-of-the-art methods on popular VITON, VITON-HD databases.

Abstract (translated)

基于图像的虚拟试穿变得越来越重要,它旨在合成特定人物穿着指定服装的图像。近年来,基于扩散模型的方法越来越受欢迎,因为它们在图像合成任务上表现出色。然而,这些方法通常需要使用额外的图像编码器,并且依赖于从服装到人物图像的跨注意机制进行纹理传递,这会影响试穿的效率和准确性。为了解决这些问题,我们提出了一个纹理保留扩散(TPD)模型进行虚拟试穿,它增强了结果的准确性,同时没有增加额外的图像编码器。从两个方面做出贡献。首先,我们提出将遮罩人员和参考服装图像沿着空间维度连接并利用结果图像作为扩散模型的去噪UNet输入,这使得扩散模型中的原始自注意力层能够实现高效且准确的纹理转移。其次,我们提出了一种新的扩散基方法,根据人员和参考服装图像预测精确的修复掩码,进一步提高了试穿结果的可靠性。此外,我们将掩码预测和图像合成集成到一个紧凑的模型中。实验结果表明,我们的方法可以应用于各种试穿任务,例如服装到人员和人员到服装的试穿,而且在流行的大型VITON和VITON-HD数据库上显著超过了最先进的方法。

URL

https://arxiv.org/abs/2404.01089

PDF

https://arxiv.org/pdf/2404.01089.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot