Abstract
Virtual try-on methods based on diffusion models achieve realistic try-on effects. They use an extra reference network or an additional image encoder to process multiple conditional image inputs, which results in high training costs. Besides, they require more than 25 inference steps, bringing a long inference time. In this work, with the development of diffusion transformer (DiT), we rethink the necessity of reference network or image encoder, then propose MC-VTON, enabling DiT to integrate minimal conditional try-on inputs by utilizing its intrinsic backbone. Compared to existing methods, the superiority of MC-VTON is demonstrated in four aspects: (1)Superior detail fidelity. Our DiT-based MC-VTON exhibits superior fidelity in preserving fine-grained details. (2)Simplified network and inputs. We remove any extra reference network or image encoder. We also remove unnecessary conditions like the long prompt, pose estimation, human parsing, and depth map. We require only the masked person image and the garment image. (3)Parameter-efficient training. To process the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters 0.33% of the backbone parameters). (4)Less inference steps. We apply distillation diffusion on MC-VTON and only need 8 steps to generate a realistic try-on image, with only 86.8M additional parameters (0.72% of the backbone parameters). Experiments show that MC-VTON achieves superior qualitative and quantitative results with fewer condition inputs, fewer inference steps, and fewer trainable parameters than baseline methods.
Abstract (translated)
基于扩散模型的虚拟试穿方法能够实现逼真的试穿效果,但这些方法通常需要额外的参考网络或图像编码器来处理多条件输入图像,导致训练成本高。此外,它们还需要超过25次推理步骤才能生成一张试穿图片,使得推理时间过长。 随着扩散变换器(Diffusion Transformer, DiT)的发展,我们重新思考了参考网络和图像编码器的必要性,并提出了MC-VTON方法。该方法利用DiT固有的骨干网来最小化条件输入处理,从而实现更高效的虚拟试穿效果。相比现有技术,MC-VTON在以下几个方面表现出显著优势: 1. **细节保真度高**:我们的基于DiT的MC-VTON在保留细微细节方面的性能优于其他模型。 2. **简化网络和输入**:我们移除了额外的参考网络或图像编码器,并且去掉了不必要的条件,如长提示、姿态估计、人体解析和深度图。只需用到遮挡的人体图像和服装图像即可进行试穿模拟。 3. **训练参数更少**:为处理试穿任务,我们在FLUX.1-dev基础上仅增加了3970万个额外参数(占骨干网参数的0.33%)来微调网络模型。 4. **推理步骤更少**:我们对MC-VTON应用了蒸馏扩散技术,在生成逼真的试穿图片时只需要8个步骤,而新增加的参数仅有8680万(占骨干网参数的0.72%)。 实验表明,MC-VTON在条件输入较少、推理步骤更少和训练参数更少的情况下,仍能获得优于基准方法的定性和定量结果。
URL
https://arxiv.org/abs/2501.03630