Paper Reading AI Learner

Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction

2025-05-22 17:52:34
Dong Li, Wenqi Zhong, Wei Yu, Yingwei Pan, Dingwen Zhang, Ting Yao, Junwei Han, Tao Mei

Abstract

Video virtual try-on aims to seamlessly dress a subject in a video with a specific garment. The primary challenge involves preserving the visual authenticity of the garment while dynamically adapting to the pose and physique of the subject. While existing methods have predominantly focused on image-based virtual try-on, extending these techniques directly to videos often results in temporal inconsistencies. Most current video virtual try-on approaches alleviate this challenge by incorporating temporal modules, yet still overlook the critical spatiotemporal pose interactions between human and garment. Effective pose interactions in videos should not only consider spatial alignment between human and garment poses in each frame but also account for the temporal dynamics of human poses throughout the entire video. With such motivation, we propose a new framework, namely Dynamic Pose Interaction Diffusion Models (DPIDM), to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on. Technically, DPIDM introduces a skeleton-based pose adapter to integrate synchronized human and garment poses into the denoising network. A hierarchical attention module is then exquisitely designed to model intra-frame human-garment pose interactions and long-term human pose dynamics across frames through pose-aware spatial and temporal attention mechanisms. Moreover, DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency. Extensive experiments conducted on VITON-HD, VVT and ViViD datasets demonstrate the superiority of our DPIDM against the baseline methods. Notably, DPIDM achieves VFID score of 0.506 on VVT dataset, leading to 60.5% improvement over the state-of-the-art GPD-VVTO approach.

Abstract (translated)

视频虚拟试衣的目标是将视频中的主体无缝地穿上特定的衣物。主要挑战在于在动态适应主体姿势和体型的同时,保持服装的真实视觉效果。尽管现有的方法大多集中在基于图像的虚拟试穿技术上,但直接将其应用到视频中通常会导致时间上的不一致性。目前大多数视频虚拟试衣的方法通过加入时间模块来缓解这一问题,但仍忽略了人类与衣物之间的关键时空姿态互动。 为了有效解决视频中的姿势交互,在每一帧中不仅需要考虑人体和衣物姿势的空间对齐,还需要考虑到整个视频中的人体姿势的动态变化。基于此动机,我们提出了一种新的框架——动态姿态互动扩散模型(Dynamic Pose Interaction Diffusion Models, DPIDM),利用扩散模型深入探索动态姿态互动在视频虚拟试衣中的应用。 技术上,DPIDM引入了一个骨架基础的姿态适配器,将同步的人体和衣物姿势整合到去噪网络中。随后设计了一种分层注意力模块,通过基于姿态的空域和时间域注意机制来建模帧内人体与衣物的姿势互动以及跨帧长时间段内的动态变化。此外,DPIDM利用连续帧之间的正则化注意损失来增强时间一致性。 在VITON-HD、VVT 和ViViD 数据集上进行的大量实验表明了我们提出的DPIDM方法相对于基线方法的优势。值得注意的是,在VVT数据集中,DPIDM达到了VFID得分为0.506,比最先进的GPD-VVTO方法提高了60.5%。

URL

https://arxiv.org/abs/2505.16980

PDF

https://arxiv.org/pdf/2505.16980.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot