Paper Reading AI Learner

Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On

2025-05-22 17:52:13
Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei

Abstract

Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: this https URL.

Abstract (translated)

扩散模型在虚拟试穿(VTON)任务中已显示出初步的成功。典型的双分支架构包括两个UNet,分别用于隐式的服装变形和合成图像生成,并已成为执行VTON任务的标准方法。然而,由于扩散模型固有的随机性,保留给定服装的形状及每一个细节的问题仍然具有挑战性。为了解决这个问题,我们新颖地提出了利用视觉对应关系作为先验知识来控制扩散过程的方法,而不是简单地将整个服装输入到UNet中作为外观参考。 具体来说,我们将精细的外观和纹理细节解释为一组结构化的语义点,并通过局部流扭曲匹配服装中的语义点与目标人体上的语义点。然后,这些2D点被增强为带有目标人物深度/法线图的3D感知线索。这种对应关系模仿了将衣物穿在人身上的过程,而3D感知线索则充当语义点匹配来监督扩散模型训练。此外,还设计了一种以点为中心的扩散损失函数,以便充分利用语义点匹配。 大量的实验表明,我们的方法能够很好地保持服装细节,并通过VITON-HD和DressCode数据集上的最先进的VTON性能得到了验证。代码在以下网址公开提供:[此链接](https://this-url.com)(请将链接替换为实际的公开代码地址)。

URL

https://arxiv.org/abs/2505.16977

PDF

https://arxiv.org/pdf/2505.16977.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot