Paper Reading AI Learner

MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer

2025-01-07 09:00:07
Junsheng Luan, Guangyuan Li, Lei Zhao, Wei Xing

Abstract

Virtual try-on methods based on diffusion models achieve realistic try-on effects. They use an extra reference network or an additional image encoder to process multiple conditional image inputs, which results in high training costs. Besides, they require more than 25 inference steps, bringing a long inference time. In this work, with the development of diffusion transformer (DiT), we rethink the necessity of reference network or image encoder, then propose MC-VTON, enabling DiT to integrate minimal conditional try-on inputs by utilizing its intrinsic backbone. Compared to existing methods, the superiority of MC-VTON is demonstrated in four aspects: (1)Superior detail fidelity. Our DiT-based MC-VTON exhibits superior fidelity in preserving fine-grained details. (2)Simplified network and inputs. We remove any extra reference network or image encoder. We also remove unnecessary conditions like the long prompt, pose estimation, human parsing, and depth map. We require only the masked person image and the garment image. (3)Parameter-efficient training. To process the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters 0.33% of the backbone parameters). (4)Less inference steps. We apply distillation diffusion on MC-VTON and only need 8 steps to generate a realistic try-on image, with only 86.8M additional parameters (0.72% of the backbone parameters). Experiments show that MC-VTON achieves superior qualitative and quantitative results with fewer condition inputs, fewer inference steps, and fewer trainable parameters than baseline methods.

Abstract (translated)

基于扩散模型的虚拟试穿方法能够实现逼真的试穿效果,但这些方法通常需要额外的参考网络或图像编码器来处理多条件输入图像,导致训练成本高。此外,它们还需要超过25次推理步骤才能生成一张试穿图片,使得推理时间过长。 随着扩散变换器(Diffusion Transformer, DiT)的发展,我们重新思考了参考网络和图像编码器的必要性,并提出了MC-VTON方法。该方法利用DiT固有的骨干网来最小化条件输入处理,从而实现更高效的虚拟试穿效果。相比现有技术,MC-VTON在以下几个方面表现出显著优势: 1. **细节保真度高**:我们的基于DiT的MC-VTON在保留细微细节方面的性能优于其他模型。 2. **简化网络和输入**:我们移除了额外的参考网络或图像编码器,并且去掉了不必要的条件,如长提示、姿态估计、人体解析和深度图。只需用到遮挡的人体图像和服装图像即可进行试穿模拟。 3. **训练参数更少**:为处理试穿任务,我们在FLUX.1-dev基础上仅增加了3970万个额外参数(占骨干网参数的0.33%)来微调网络模型。 4. **推理步骤更少**:我们对MC-VTON应用了蒸馏扩散技术,在生成逼真的试穿图片时只需要8个步骤,而新增加的参数仅有8680万(占骨干网参数的0.72%)。 实验表明,MC-VTON在条件输入较少、推理步骤更少和训练参数更少的情况下,仍能获得优于基准方法的定性和定量结果。

URL

https://arxiv.org/abs/2501.03630

PDF

https://arxiv.org/pdf/2501.03630.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot