Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. In this paper, we propose DS-VTON, a dual-scale virtual try-on framework that explicitly disentangles these objectives for more effective modeling. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. The second stage introduces a residual-guided diffusion process that reconstructs high-resolution outputs by refining the residual between the two scales, focusing on texture fidelity. In addition, our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks. By leveraging the semantic priors embedded in pretrained diffusion models, this design more effectively preserves the person's appearance and geometric consistency. Extensive experiments demonstrate that DS-VTON achieves state-of-the-art performance in both structural alignment and texture preservation across multiple standard virtual try-on benchmarks.
尽管最近取得了进展,大多数现有的虚拟试穿方法仍然难以同时解决两个核心挑战:准确地将服装图像与目标人体对齐以及保留精细的服装纹理和图案。在本文中,我们提出了DS-VTON,这是一种双尺度虚拟试穿框架,该框架明确分离了这些目标以进行更有效的建模。DS-VTON包含两阶段:第一阶段生成低分辨率的试穿结果,捕获服装与人体之间的语义对应关系,减少细节有助于结构对齐的稳健性;第二阶段引入了一种残差引导的扩散过程,通过细化两个尺度之间的残差来重建高分辨率输出,并专注于纹理保真度。此外,我们的方法采用了一个完全不依赖于掩码的生成范式,消除了对人类解析图或分割掩码的依赖。通过利用预训练扩散模型中嵌入的语义先验,这种设计更有效地保持了人物的外观和几何一致性。广泛的实验表明,DS-VTON在多个标准虚拟试穿基准测试中的结构对齐和纹理保真度方面均达到了最先进的性能。
https://arxiv.org/abs/2506.00908
Human-centric perception is the core of diverse computer vision tasks and has been a long-standing research focus. However, previous research studied these human-centric tasks individually, whose performance is largely limited to the size of the public task-specific datasets. Recent human-centric methods leverage the additional modalities, e.g., depth, to learn fine-grained semantic information, which limits the benefit of pretraining models due to their sensitivity to camera views and the scarcity of RGB-D data on the Internet. This paper improves the data scalability of human-centric pretraining methods by discarding depth information and exploring semantic information of RGB images in the frequency space by Discrete Cosine Transform (DCT). We further propose new annotation denoising auxiliary tasks with keypoints and DCT maps to enforce the RGB image extractor to learn fine-grained semantic information of human bodies. Our extensive experiments show that when pretrained on large-scale datasets (COCO and AIC datasets) without depth annotation, our model achieves better performance than state-of-the-art methods by +0.5 mAP on COCO, +1.4 PCKh on MPII and -0.51 EPE on Human3.6M for pose estimation, by +4.50 mIoU on Human3.6M for human parsing, by -3.14 MAE on SHA and -0.07 MAE on SHB for crowd counting, by +1.1 F1 score on SHA and +0.8 F1 score on SHA for crowd localization, and by +0.1 mAP on Market1501 and +0.8 mAP on MSMT for person ReID. We also validate the effectiveness of our method on MPII+NTURGBD datasets
人本感知是多样化的计算机视觉任务的核心,并一直是长期的研究焦点。然而,先前的研究分别独立研究这些以人类为中心的任务,其性能很大程度上受限于公共特定任务数据集的大小。近期的人本方法利用额外的模态(如深度信息)来学习细粒度语义信息,但由于对视角敏感以及互联网上的RGB-D数据稀缺,这限制了预训练模型的好处。 本文通过放弃深度信息并在频域中探索RGB图像的语义信息来改进人本预训练方法的数据扩展性。我们进一步提出使用关键点和离散余弦变换(DCT)图的新注释去噪辅助任务,以促使RGB图像提取器学习人体细粒度语义信息。 我们的广泛实验表明,在大规模数据集(COCO和AIC数据集)上进行预训练且不依赖深度标注的情况下,我们所提出的方法在姿态估计、人体解析、人群计数、人群定位以及人员重识别等任务中均优于现有的最先进方法。例如,在COCO数据集上的mAP提高0.5%,MPII数据集的PCKh指标提高1.4%,Human3.6M数据集中EPE(端点误差)降低0.51;在人体解析方面,Human3.6M数据集的mIoU提高了4.50分。此外,在人群计数任务上,SHA和SHB数据集上的MAE分别降低了3.14和0.07。在人群定位任务中,SHA和SHB数据集中的F1分数增加了1.1和0.8。对于人员重识别,Market1501和MSMT数据集中mAP分别提高了0.1和0.8。 我们还通过MPII+NTURGBD数据集验证了该方法的有效性。
https://arxiv.org/abs/2504.20800
Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images, but the modality differences and the complexity of identity features make it challenging. Existing methods rely solely on identity label supervision, which makes it difficult to fully extract high-level semantic information. Recently, vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling by generating textual descriptions. However, such methods do not explicitly model body shape features, which are crucial for cross-modal matching. To address this, we propose an effective Body Shape-aware Textual Alignment (BSaTa) framework that explicitly models and utilizes body shape information to improve VIReID performance. Specifically, we design a Body Shape Textual Alignment (BSTA) module that extracts body shape information using a human parsing model and converts it into structured text representations via CLIP. We also design a Text-Visual Consistency Regularizer (TVCR) to ensure alignment between body shape textual representations and visual body shape features. Furthermore, we introduce a Shape-aware Representation Learning (SRL) mechanism that combines Multi-text Supervision and Distribution Consistency Constraints to guide the visual encoder to learn modality-invariant and discriminative identity features, thus enhancing modality invariance. Experimental results demonstrate that our method achieves superior performance on the SYSU-MM01 and RegDB datasets, validating its effectiveness.
可见光-红外行人重识别(VIReID)旨在匹配可见光和红外图像中的行人,但由于模态差异以及身份特征的复杂性,这使得任务具有挑战性。现有的方法主要依赖于身份标签监督来提取高层语义信息,这种方法难以全面利用高阶语义信息。最近,视觉-语言预训练模型被引入到VIReID中,通过生成文本描述来增强语义信息建模。然而,这些方法没有显式地对体形特征进行建模,而这对于跨模态匹配至关重要。 为了解决这个问题,我们提出了一种有效的基于体形的文本对齐(Body Shape-aware Textual Alignment, BSaTa)框架,该框架通过利用和建模体形信息来提高VIReID的表现。具体来说,我们设计了一个体形文本对齐模块(BSTA),它使用人体解析模型提取体形特征,并通过CLIP将其转换为结构化文本表示形式。此外,我们还设计了一种基于文本-视觉一致性的正则器(Text-Visual Consistency Regularizer, TVCR)来确保体形文本表示与视觉中的体形特征对齐。 为了进一步提升表现,我们引入了形状感知表征学习机制(Shape-aware Representation Learning, SRL),该机制结合了多文本监督和分布一致性约束,指导视觉编码器学习模态不变且具有判别性的身份特征。实验结果表明,在SYSU-MM01和RegDB数据集上,我们的方法取得了优越的表现,验证了其有效性。 这种新的框架不仅增强了语义信息建模,还通过显式地考虑体形特征来提高了跨模态匹配的准确性,并在一定程度上解决了不同模态之间的不一致性问题。
https://arxiv.org/abs/2504.18025
Vision Foundation Model (VFM) such as the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training Model (CLIP) has shown promising performance for segmentation and detection tasks. However, although SAM excels in fine-grained segmentation, it faces major challenges when applying it to semantic-aware segmentation. While CLIP exhibits a strong semantic understanding capability via aligning the global features of language and vision, it has deficiencies in fine-grained segmentation tasks. Human parsing requires to segment human bodies into constituent parts and involves both accurate fine-grained segmentation and high semantic understanding of each part. Based on traits of SAM and CLIP, we formulate high efficient modules to effectively integrate features of them to benefit human parsing. We propose a Semantic-Refinement Module to integrate semantic features of CLIP with SAM features to benefit parsing. Moreover, we formulate a high efficient Fine-tuning Module to adjust the pretrained SAM for human parsing that needs high semantic information and simultaneously demands spatial details, which significantly reduces the training time compared with full-time training and achieves notable performance. Extensive experiments demonstrate the effectiveness of our method on LIP, PPP, and CIHP databases.
翻译如下: 基于视觉基础模型(VFM)如分割任何模型(SAM)和对比语言-图像预训练模型(CLIP)在分割和检测任务中展现出了令人鼓舞的性能。然而,虽然SAM在细粒度分割方面表现出色,但在应用于语义感知分割时却面临重大挑战。同时,尽管CLIP通过将语言和视觉的全局特征对齐来展示出强大的语义理解能力,但它在细粒度分割任务中的表现不佳。人体解析需要对人体进行成分划分,并且既要求准确的细粒度分割又需要各部分具有高度的语义理解。基于SAM和CLIP的特点,我们设计了高效的模块以有效地整合它们的功能,从而有益于人体解析。我们提出了一种语义精炼模块(Semantic-Refinement Module),将CLIP的语义特征与SAM功能相结合,有利于解析。此外,我们还制定了一个高效微调模块,用于调整预训练的SAM,使其适应需要高度语义信息和同时要求空间细节的人体解析任务,相较于完全重新训练显著减少了训练时间,并且取得了显著性能。广泛的实验在LIP、PPP和CIHP数据库上证明了该方法的有效性。
https://arxiv.org/abs/2503.22237
Existing image-based virtual try-on methods directly transfer specific clothing to a human image without utilizing clothing attributes to refine the transferred clothing geometry and textures, which causes incomplete and blurred clothing appearances. In addition, these methods usually mask the limb textures of the input for the clothing-agnostic person representation, which results in inaccurate predictions for human limb regions (i.e., the exposed arm skin), especially when transforming between long-sleeved and short-sleeved garments. To address these problems, we present a progressive virtual try-on framework, named PL-VTON, which performs pixel-level clothing warping based on multiple attributes of clothing and embeds explicit limb-aware features to generate photo-realistic try-on results. Specifically, we design a Multi-attribute Clothing Warping (MCW) module that adopts a two-stage alignment strategy based on multiple attributes to progressively estimate pixel-level clothing displacements. A Human Parsing Estimator (HPE) is then introduced to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and limb regions. Finally, we propose a Limb-aware Texture Fusion (LTF) module to estimate high-quality details in limb regions by fusing textures of the clothing and the human body with the guidance of explicit limb-aware features. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art virtual try-on methods both qualitatively and quantitatively. The code is available at this https URL.
现有的基于图像的虚拟试穿方法直接将特定服装转移到人体图像上,而不利用衣物属性来精炼转移后的服装几何和纹理信息,这导致了不完整且模糊的服装外观。此外,这些方法通常会遮蔽输入肢体的纹理以获得与衣服无关的人体表示,从而在预测人类肢体区域(如露出的手臂皮肤)时产生不准确的结果,尤其是在长袖和短袖服装转换之间更为明显。为了应对这些问题,我们提出了一种渐进式虚拟试穿框架PL-VTON,该框架基于多属性进行像素级的衣物变形,并嵌入明确的人体部位感知特征以生成逼真的试穿结果。 具体来说,我们设计了一个名为Multi-attribute Clothing Warping (MCW)模块,采用基于多个属性的两阶段对齐策略来逐步估计像素级别的服装位移。接着引入了Human Parsing Estimator(HPE),将人体语义性地划分为不同的区域,从而对人体结构提供约束,并因此减轻了衣物与肢体区域之间的纹理溢出现象。 最后,我们提出了一种Limb-aware Texture Fusion (LTF)模块,在融合服装和人体部位的纹理时使用明确的人体部位感知特征来估计高细节质量的肢体区域信息。广泛的实验表明,我们提出的方法在定性和定量上都优于现有的虚拟试穿方法。相关代码可在提供的链接中获取。
https://arxiv.org/abs/2503.12588
Gait recognition has emerged as a robust biometric modality due to its non-intrusive nature and resilience to occlusion. Conventional gait recognition methods typically rely on silhouettes or skeletons. Despite their success in gait recognition for controlled laboratory environments, they usually fail in real-world scenarios due to their limited information entropy for gait representations. To achieve accurate gait recognition in the wild, we propose a novel gait representation, named Parsing Skeleton. This representation innovatively introduces the skeleton-guided human parsing method to capture fine-grained body dynamics, so they have much higher information entropy to encode the shapes and dynamics of fine-grained human parts during walking. Moreover, to effectively explore the capability of the parsing skeleton representation, we propose a novel parsing skeleton-based gait recognition framework, named PSGait, which takes parsing skeletons and silhouettes as input. By fusing these two modalities, the resulting image sequences are fed into gait recognition models for enhanced individual differentiation. We conduct comprehensive benchmarks on various datasets to evaluate our model. PSGait outperforms existing state-of-the-art multimodal methods. Furthermore, as a plug-and-play method, PSGait leads to a maximum improvement of 10.9% in Rank-1 accuracy across various gait recognition models. These results demonstrate the effectiveness and versatility of parsing skeletons for gait recognition in the wild, establishing PSGait as a new state-of-the-art approach for multimodal gait recognition.
步态识别作为一种稳健的生物识别模式,由于其非侵入性和抗遮挡性而崭露头角。传统步态识别方法通常依赖于轮廓或骨架。尽管这些方法在受控实验室环境中取得了成功,但它们在现实世界场景中往往表现不佳,因为它们用来表示步态的信息熵非常有限。为了实现野外环境中的准确步态识别,我们提出了一种新颖的步态表示法,称为解析骨架(Parsing Skeleton)。这种表示通过引入骨骼引导的人体解析方法来捕捉细微的身体动态,从而显著提高了信息熵,能够更好地编码步行过程中人体各部分的具体形状和动态。 为了有效利用解析骨架表征的能力,我们提出了一个基于解析骨架的新型步态识别框架,命名为PSGait。该框架接受解析骨架和轮廓作为输入,并通过融合这两种模态的信息来提高图像序列在步态识别模型中的个体区分度。我们在多种数据集上进行了全面基准测试以评估我们的模型性能。结果表明,PSGait超越了现有的最先进的多模式方法。 此外,作为一种即插即用的方法,PSGait在各种步态识别模型中实现了最高达10.9%的Rank-1准确率提升。这些结果证明了解析骨架对于野外环境中的步态识别的有效性和灵活性,并将PSGait确立为新的多模态步态识别前沿方法。
https://arxiv.org/abs/2503.12047
Virtual try-on methods based on diffusion models achieve realistic try-on effects. They use an extra reference network or an additional image encoder to process multiple conditional image inputs, which results in high training costs. Besides, they require more than 25 inference steps, bringing a long inference time. In this work, with the development of diffusion transformer (DiT), we rethink the necessity of reference network or image encoder, then propose MC-VTON, enabling DiT to integrate minimal conditional try-on inputs by utilizing its intrinsic backbone. Compared to existing methods, the superiority of MC-VTON is demonstrated in four aspects: (1)Superior detail fidelity. Our DiT-based MC-VTON exhibits superior fidelity in preserving fine-grained details. (2)Simplified network and inputs. We remove any extra reference network or image encoder. We also remove unnecessary conditions like the long prompt, pose estimation, human parsing, and depth map. We require only the masked person image and the garment image. (3)Parameter-efficient training. To process the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters 0.33% of the backbone parameters). (4)Less inference steps. We apply distillation diffusion on MC-VTON and only need 8 steps to generate a realistic try-on image, with only 86.8M additional parameters (0.72% of the backbone parameters). Experiments show that MC-VTON achieves superior qualitative and quantitative results with fewer condition inputs, fewer inference steps, and fewer trainable parameters than baseline methods.
基于扩散模型的虚拟试穿方法能够实现逼真的试穿效果,但这些方法通常需要额外的参考网络或图像编码器来处理多条件输入图像,导致训练成本高。此外,它们还需要超过25次推理步骤才能生成一张试穿图片,使得推理时间过长。 随着扩散变换器(Diffusion Transformer, DiT)的发展,我们重新思考了参考网络和图像编码器的必要性,并提出了MC-VTON方法。该方法利用DiT固有的骨干网来最小化条件输入处理,从而实现更高效的虚拟试穿效果。相比现有技术,MC-VTON在以下几个方面表现出显著优势: 1. **细节保真度高**:我们的基于DiT的MC-VTON在保留细微细节方面的性能优于其他模型。 2. **简化网络和输入**:我们移除了额外的参考网络或图像编码器,并且去掉了不必要的条件,如长提示、姿态估计、人体解析和深度图。只需用到遮挡的人体图像和服装图像即可进行试穿模拟。 3. **训练参数更少**:为处理试穿任务,我们在FLUX.1-dev基础上仅增加了3970万个额外参数(占骨干网参数的0.33%)来微调网络模型。 4. **推理步骤更少**:我们对MC-VTON应用了蒸馏扩散技术,在生成逼真的试穿图片时只需要8个步骤,而新增加的参数仅有8680万(占骨干网参数的0.72%)。 实验表明,MC-VTON在条件输入较少、推理步骤更少和训练参数更少的情况下,仍能获得优于基准方法的定性和定量结果。
https://arxiv.org/abs/2501.03630
The gait, as a kind of soft biometric characteristic, can reflect the distinct walking patterns of individuals at a distance, exhibiting a promising technique for unrestrained human identification. With largely excluding gait-unrelated cues hidden in RGB videos, the silhouette and skeleton, though visually compact, have acted as two of the most prevailing gait modalities for a long time. Recently, several attempts have been made to introduce more informative data forms like human parsing and optical flow images to capture gait characteristics, along with multi-branch architectures. However, due to the inconsistency within model designs and experiment settings, we argue that a comprehensive and fair comparative study among these popular gait modalities, involving the representational capacity and fusion strategy exploration, is still lacking. From the perspectives of fine vs. coarse-grained shape and whole vs. pixel-wise motion modeling, this work presents an in-depth investigation of three popular gait representations, i.e., silhouette, human parsing, and optical flow, with various fusion evaluations, and experimentally exposes their similarities and differences. Based on the obtained insights, we further develop a C$^2$Fusion strategy, consequently building our new framework MultiGait++. C$^2$Fusion preserves commonalities while highlighting differences to enrich the learning of gait features. To verify our findings and conclusions, extensive experiments on Gait3D, GREW, CCPG, and SUSTech1K are conducted. The code is available at this https URL.
步态作为一种软生物特征,可以在远距离下反映出个体独特的行走模式,展现了一种有前景的无约束人体识别技术。通过大量排除隐藏在RGB视频中的与步态无关的信息,轮廓和骨架图像虽然视觉上较为紧凑,但长期以来一直是两种最流行的步态表现形式。最近,一些尝试引入更多具有信息量的数据格式,如人体分割图和光流图来捕捉步态特征,并结合多分支架构。然而,由于模型设计和实验设置的不一致性,我们认为这些流行步态模态之间缺乏全面且公平的比较研究,特别是在表征能力和融合策略探索方面。从精细与粗略形状、整体与像素级运动建模的角度出发,本工作深入探讨了三种流行的步态表示形式,即轮廓图、人体分割图和光流图,并进行了各种融合评估,实验上揭示了它们的相似性和差异性。基于获得的见解,我们进一步开发了一种C$^2$Fusion策略,进而构建了新的框架MultiGait++。C$^2$Fusion保留共性的同时突出差异,以丰富步态特征的学习。为了验证我们的发现和结论,在Gait3D、GREW、CCPG和SUSTech1K上进行了广泛的实验。代码可在此链接https URL处获取。
https://arxiv.org/abs/2412.11495
This paper studies a combined person reidentification (re-id) method that uses human parsing, analytical feature extraction and similarity estimation schemes. One of its prominent features is its low computational requirements so it can be implemented on edge devices. The method allows direct comparison of specific image regions using interpretable features which consist of color and texture channels. It is proposed to analyze and compare colors in CIE-Lab color space using histogram smoothing for noise reduction. A novel pre-configured latent space (LS) supervised autoencoder (SAE) is proposed for texture analysis which encodes input textures as LS points. This allows to obtain more accurate similarity measures compared to simplistic label comparison. The proposed method also does not rely upon photos or other re-id data for training, which makes it completely re-id dataset-agnostic. The viability of the proposed method is verified by computing rank-1, rank-10, and mAP re-id metrics on Market1501 dataset. The results are comparable to those of conventional deep learning methods and the potential ways to further improve the method are discussed.
本文研究了一种结合人体解析、特征提取和相似度估计方案的行人重识别(re-id)方法。该方法的一个显著特点是其计算需求低,因此可以在边缘设备上实现。该方法允许使用可解释特征直接比较特定图像区域,这些特征包括颜色和纹理通道。建议在CIE-Lab色彩空间中分析和比较颜色,并通过直方图平滑来减少噪声。为了进行纹理分析,提出了一种预配置的潜在空间(LS)监督自编码器(SAE),它将输入纹理编码为LS点。这相比于简单的标签对比,可以获得更准确的相似度测量结果。所提方法不需要依赖照片或其他重识别数据来进行训练,使其完全不依赖于特定的重识别数据集。通过在Market1501数据集上计算rank-1、rank-10和mAP重识别指标来验证了该方法的有效性。实验结果显示,与传统深度学习方法相比,其性能可比,并讨论了进一步改进该方法的可能性。
https://arxiv.org/abs/2412.05076
Existing studies for gait recognition primarily utilized sequences of either binary silhouette or human parsing to encode the shapes and dynamics of persons during walking. Silhouettes exhibit accurate segmentation quality and robustness to environmental variations, but their low information entropy may result in sub-optimal performance. In contrast, human parsing provides fine-grained part segmentation with higher information entropy, but the segmentation quality may deteriorate due to the complex environments. To discover the advantages of silhouette and parsing and overcome their limitations, this paper proposes a novel cross-granularity alignment gait recognition method, named XGait, to unleash the power of gait representations of different granularity. To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces, respectively. Moreover, to explore the complementary knowledge across the features of two representations, we design the Global Cross-granularity Module (GCM) and the Part Cross-granularity Module (PCM) after the two encoders. In particular, the GCM aims to enhance the quality of parsing features by leveraging global features from silhouettes, while the PCM aligns the dynamics of human parts between silhouette and parsing features using the high information entropy in parsing sequences. In addition, to effectively guide the alignment of two representations with different granularity at the part level, an elaborate-designed learnable division mechanism is proposed for the parsing features. Comprehensive experiments on two large-scale gait datasets not only show the superior performance of XGait with the Rank-1 accuracy of 80.5% on Gait3D and 88.3% CCPG but also reflect the robustness of the learned features even under challenging conditions like occlusions and cloth changes.
现有的步态识别研究主要利用二值轮廓序列或人体分割序列来编码行走过程中的人物形状和动态。轮廓表现出准确的分割质量和对环境变化的强大鲁棒性,但其低信息熵可能导致性能不佳。相比之下,人体解析提供了更高信息熵的细粒度部分分割,但由于复杂环境的影响,分割质量可能会下降。为了发现轮廓和解析的优势并克服它们的局限性,本文提出了一种新颖的跨粒度步态识别方法,名为XGait,以释放不同粒度下的步态表示力。为实现这一目标,XGait首先包含两个骨干编码器分支,分别将轮廓序列和解析序列映射到两个潜在空间中。此外,为了探索两种表示特征之间的互补知识,在两个编码器之后设计了全局跨粒度模块(GCM)和部分跨粒度模块(PCM)。特别是,GCM旨在通过利用来自轮廓的全局特征来增强解析特征的质量,而PCM则使用解析序列中的高信息熵对轮廓与解析特征之间的人体部位动态进行对齐。此外,为了在部分级别上有效地指导两种不同粒度表示的对齐,提出了一个精心设计的学习分割机制用于解析特征。在两个大规模步态数据集上的综合实验不仅展示了XGait以80.5%的Rank-1准确率在Gait3D和88.3%CCPG中的优越性能,而且还反映了学习到的特征即使在遮挡和衣物变化等具有挑战性的条件下也具备鲁棒性。
https://arxiv.org/abs/2411.10742
Virtual try-on methods based on diffusion models achieve realistic try-on effects but often replicate the backbone network as a ReferenceNet or use additional image encoders to process condition inputs, leading to high training and inference costs. In this work, we rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment and person by proposing CatVTON, a simple and efficient virtual try-on diffusion model. CatVTON facilitates the seamless transfer of in-shop or worn garments of any category to target persons by simply concatenating them in spatial dimensions as inputs. The efficiency of our model is demonstrated in three aspects: (1) Lightweight network: Only the original diffusion modules are used, without additional network modules. The text encoder and cross-attentions for text injection in the backbone are removed, reducing the parameters by 167.02M. (2) Parameter-efficient training: We identified the try-on relevant modules through experiments and achieved high-quality try-on effects by training only 49.57M parameters, approximately 5.51 percent of the backbone network's parameters. (3) Simplified inference: CatVTON eliminates all unnecessary conditions and preprocessing steps, including pose estimation, human parsing, and text input, requiring only a garment reference, target person image, and mask for the virtual try-on process. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results with fewer prerequisites and trainable parameters than baseline methods. Furthermore, CatVTON shows good generalization in in-the-wild scenarios despite using open-source datasets with only 73K samples.
https://arxiv.org/abs/2407.15886
Cloth-changing person re-identification (CC-ReID) aims to retrieve specific pedestrians in a cloth-changing scenario. Its main challenge is to disentangle the clothing-related and clothing-unrelated features. Most existing approaches force the model to learn clothing-unrelated features by changing the color of the clothes. However, due to the lack of ground truth, these methods inevitably introduce noise, which destroys the discriminative features and leads to an uncontrollable disentanglement process. In this paper, we propose a new person re-identification network called features reconstruction disentanglement ReID (FRD-ReID), which can controllably decouple the clothing-unrelated and clothing-related features. Specifically, we first introduce the human parsing mask as the ground truth of the reconstruction process. At the same time, we propose the far away attention (FAA) mechanism and the person contour attention (PCA) mechanism for clothing-unrelated features and pedestrian contour features to improve the feature reconstruction efficiency. In the testing phase, we directly discard the clothing-related features for inference,which leads to a controllable disentanglement process. We conducted extensive experiments on the PRCC, LTCC, and Vc-Clothes datasets and demonstrated that our method outperforms existing state-of-the-art methods.
衣物更换人识别(CC-ReID)旨在从换衣场景中检索特定行人。它的主要挑战是区分相关和无关的特征。现有方法通过改变衣服的颜色来强制模型学习无关的特征。然而,由于缺乏真实数据,这些方法不可避免地引入噪声,破坏了可区分特征,导致不可控的解纠缠过程。在本文中,我们提出了一个名为特征重构解纠缠ReID(FRD-ReID)的新人物识别网络,可以控制地解耦相关和无关特征。具体来说,我们首先引入人类解析掩码作为重构过程的地面真值。同时,我们提出了用于服装无关特征和行人轮廓特征的远方注意(FAA)机制和人物轮廓注意(PCA)机制,以提高特征重构效率。在测试阶段,我们直接丢弃与推理相关的衣服特征,从而导致可控制的解纠缠过程。我们对PRCC、LTCC和Vc-Clothes数据集进行了广泛的实验,并证明了我们的方法超越了现有最先进的方法。
https://arxiv.org/abs/2407.10694
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best method in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at \url{this https URL}.
多种任务密集场景理解是一种学会多个密集预测任务模型的方法,具有广泛的应用场景。建模长距离依赖和增强跨任务交互是多任务密集预测的关键。在本文中,我们提出了MTMamba,一种基于Mamba的多任务场景理解新架构。它包含两种核心模块:自任务Mamba(STM)模块和跨任务Mamba(CTM)模块。STM通过利用Mamba处理长距离依赖,而CTM明确建模了任务交互以促进任务间信息交流。在NYUDv2和PASCAL-Context数据集上的实验证明,MTMamba相对于基于Transformer和CNN的方法具有卓越的性能。值得注意的是,在PASCAL-Context数据集上,MTMamba在语义分割、人解析和物体边界检测等任务上分别实现了+2.08、+5.01和+4.90的改进,超过了前最佳方法。代码可在此处访问:\url{https:// this https URL }。
https://arxiv.org/abs/2407.02228
The studies of human clothing for digital avatars have predominantly relied on synthetic datasets. While easy to collect, synthetic data often fall short in realism and fail to capture authentic clothing dynamics. Addressing this gap, we introduce 4D-DRESS, the first real-world 4D dataset advancing human clothing research with its high-quality 4D textured scans and garment meshes. 4D-DRESS captures 64 outfits in 520 human motion sequences, amounting to 78k textured scans. Creating a real-world clothing dataset is challenging, particularly in annotating and segmenting the extensive and complex 4D human scans. To address this, we develop a semi-automatic 4D human parsing pipeline. We efficiently combine a human-in-the-loop process with automation to accurately label 4D scans in diverse garments and body movements. Leveraging precise annotations and high-quality garment meshes, we establish several benchmarks for clothing simulation and reconstruction. 4D-DRESS offers realistic and challenging data that complements synthetic sources, paving the way for advancements in research of lifelike human clothing. Website: this https URL.
为了填补这一空白,我们引入了4D-DRESS,这是第一个通过其高质量的4D纹理扫描和服装网格 Advance 人类服装研究 的真实世界4D数据集。4D-DRESS 捕捉了520个人动序列中的64个着装,共计78k个纹理扫描。创建真实世界的服装数据集具有挑战性,特别是在对广泛而复杂的4D人类扫描进行注释和分割方面。为了应对这个挑战,我们开发了一个半自动化的4D人体解析管道。我们有效地将人机交互过程与自动化相结合,准确地在各种服装和身体运动中标注4D扫描。利用精确的注释和高质量的服装网格,我们为服装模拟和重建建立了多个基准。4D-DRESS 提供了真实和具有挑战性的数据,补充了合成数据,为逼真的人类服装研究铺平了道路。网站:这是这个链接。
https://arxiv.org/abs/2404.18630
The goal of occluded person re-identification (ReID) is to retrieve specific pedestrians in occluded situations. However, occluded person ReID still suffers from background clutter and low-quality local feature representations, which limits model performance. In our research, we introduce a new framework called PAB-ReID, which is a novel ReID model incorporating part-attention mechanisms to tackle the aforementioned issues effectively. Firstly, we introduce the human parsing label to guide the generation of more accurate human part attention maps. In addition, we propose a fine-grained feature focuser for generating fine-grained human local feature representations while suppressing background interference. Moreover, We also design a part triplet loss to supervise the learning of human local features, which optimizes intra/inter-class distance. We conducted extensive experiments on specialized occlusion and regular ReID datasets, showcasing that our approach outperforms the existing state-of-the-art methods.
遮挡人物识别(ReID)的目标是检索遮挡情况下的特定行人。然而,遮挡人物ReID仍然受到背景杂乱和低质量局部特征表示的限制,这限制了模型的性能。在我们的研究中,我们引入了一个新的框架PAB-ReID,这是一种新型的ReID模型,采用了部分注意机制来有效解决上述问题。首先,我们引入了人类解析标签来指导生成更准确的人的部分注意力图。此外,我们提出了一种细粒度特征关注器,用于在抑制背景干扰的同时生成细粒度的人局部特征表示。此外,我们还设计了一个部分三元组损失来指导人局部特征的学习,该损失优化了类内/类间距离。我们在专门的遮挡和普通ReID数据集上进行了广泛的实验,展示了我们的方法超越了现有最先进的方法。
https://arxiv.org/abs/2404.03443
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks, a first of its kind in the field. It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection, addressing the significant challenges posed by overfitting and limited training data in these domains. Our work categorizes data augmentation methods into two main types: data generation and data perturbation. Data generation covers techniques like graphic engine-based generation, generative model-based generation, and data recombination, while data perturbation is divided into image-level and human-level perturbations. Each method is tailored to the unique requirements of human-centric tasks, with some applicable across multiple areas. Our contributions include an extensive literature review, providing deep insights into the influence of these augmentation techniques in human-centric vision and highlighting the nuances of each method. We also discuss open issues and future directions, such as the integration of advanced generative models like Latent Diffusion Models, for creating more realistic and diverse training data. This survey not only encapsulates the current state of data augmentation in human-centric vision but also charts a course for future research, aiming to develop more robust, accurate, and efficient human-centric vision systems.
这项调查对人类中心化视觉任务中的数据增强技术进行了全面分析,是该领域独一无二的。它深入研究了包括人物识别、人解析、人姿势估计和行人检测在内的广泛研究领域,解决了过拟合和有限训练数据在这些领域带来的显著挑战。我们的工作将数据增强方法分为两种主要类型:数据生成和数据扰动。数据生成包括基于图形引擎生成、基于生成模型生成和数据重组等技术,而数据扰动则分为图像级别和人类级别扰动。每种方法都是针对人类中心化任务独特的需求进行定制的,有些方法可以应用于多个领域。我们的贡献包括广泛的文献 review,为这些增强技术在人类中心化视觉和每个方法的影响力提供了深刻的洞察。我们还讨论了未解决的问题和未来的研究方向,例如采用先进的生成模型如潜在扩散模型,以创建更真实和多样化的训练数据。这项调查不仅概括了当前数据增强在人类中心化视觉中的状态,而且为未来的研究奠定了基础,旨在开发更健壮、准确和高效的以人为本视觉系统。
https://arxiv.org/abs/2403.08650
Recent unsupervised person re-identification (re-ID) methods achieve high performance by leveraging fine-grained local context. These methods are referred to as part-based methods. However, most part-based methods obtain local contexts through horizontal division, which suffer from misalignment due to various human poses. Additionally, the misalignment of semantic information in part features restricts the use of metric learning, thus affecting the effectiveness of part-based methods. The two issues mentioned above result in the under-utilization of part features in part-based methods. We introduce the Spatial Cascaded Clustering and Weighted Memory (SCWM) method to address these challenges. SCWM aims to parse and align more accurate local contexts for different human body parts while allowing the memory module to balance hard example mining and noise suppression. Specifically, we first analyze the foreground omissions and spatial confusions issues in the previous method. Then, we propose foreground and space corrections to enhance the completeness and reasonableness of the human parsing results. Next, we introduce a weighted memory and utilize two weighting strategies. These strategies address hard sample mining for global features and enhance noise resistance for part features, which enables better utilization of both global and part features. Extensive experiments on Market-1501 and MSMT17 validate the proposed method's effectiveness over many state-of-the-art methods.
最近,无监督的人重新识别(Re-ID)方法通过利用细粒度局部上下文取得了高性能。这些方法被称为基于部分的(part-based)方法。然而,大多数基于部分的方法通过水平分割获得局部上下文,这会导致因为各种人体姿势而产生的错位。此外,部分特征中的语义信息错位限制了使用指标学习,从而影响了基于部分的方法的有效性。上述两个问题导致基于部分的方法中部分特征的利用率较低。我们引入了空间级联聚类和加权记忆(SCWM)方法来解决这些问题。SCWM旨在解析和校准不同人体部位更准确的局部上下文,同时允许记忆模块平衡难样本挖掘和噪声抑制。具体来说,我们首先分析了前方法中的前景缺失和空间混淆问题。然后,我们提出了前景和空间修正来提高人类解析结果的完整性和合理性。接下来,我们引入了加权记忆,并利用了两种加权策略。这些策略解决了全局特征的难样本挖掘问题,并提高了部分特征的噪声抵抗能力,从而更好地利用全局和部分特征。在Market-1501和MSMT17等大量实验中,我们验证了所提出方法的有效性超过了许多最先进的method。
https://arxiv.org/abs/2403.00261
The paper introduces the Decouple Re-identificatiOn and human Parsing (DROP) method for occluded person re-identification (ReID). Unlike mainstream approaches using global features for simultaneous multi-task learning of ReID and human parsing, or relying on semantic information for attention guidance, DROP argues that the inferior performance of the former is due to distinct granularity requirements for ReID and human parsing features. ReID focuses on instance part-level differences between pedestrian parts, while human parsing centers on semantic spatial context, reflecting the internal structure of the human body. To address this, DROP decouples features for ReID and human parsing, proposing detail-preserving upsampling to combine varying resolution feature maps. Parsing-specific features for human parsing are decoupled, and human position information is exclusively added to the human parsing branch. In the ReID branch, a part-aware compactness loss is introduced to enhance instance-level part differences. Experimental results highlight the efficacy of DROP, especially achieving a Rank-1 accuracy of 76.8% on Occluded-Duke, surpassing two mainstream methods. The codebase is accessible at this https URL.
本文介绍了一种名为Decouple Re-identification and human Parsing (DROP)的方法,用于解决遮挡的人重新识别(ReID)问题。与使用全局特征进行同时多任务学习ReID和人类解析的传统方法不同,或者依赖语义信息进行关注指导的方法,DROP认为,前者的性能差是由于ReID和人类解析特征的细粒度要求不同导致的。ReID关注于行人部分之间的实例级别差异,而人类解析关注于语义空间上下文,反映了人体内部的结构。为了应对这个问题,DROP解耦了ReID和人类解析的特征,提出了一种保留详细信息的上采样方法,结合了不同分辨率特征图。人类解析特定的特征被解耦,而人类位置信息被专门添加到人类解析分支中。在ReID分支中,引入了一种部分感知紧凑性损失,以增强实例级别的部分差异。实验结果强调了DROP的有效性,尤其是实现了在Occluded-Duke上的排名1准确度为76.8%,超过了两个主要方法。代码库可在此处访问:https://github.com/google-research/DROP。
https://arxiv.org/abs/2401.18032
Multimodal-based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts, while filtering out most irrelevant noise. We propose a new dual-branch framework called Ensemble Human Parsing and Pose Network (EPP-Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing festures via convolutional backbones. The two high-level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net, which outperforms the existing action recognition methods. Our code is available at: this https URL.
多模态动作识别方法通过姿态和RGB模态取得了高度的成功。然而,骨架序列缺乏外观描述,RGB图像因模态限制而受到无关噪声的影响。为了解决这个问题,我们引入了人类解析特征图作为一种新的模式,因为它可以选择性地保留身体部位的有效语义特征,同时过滤出大多数无关噪声。我们提出了一个名为Ensemble Human Parsing and Pose Network (EPP-Net)的新双分支框架,它是第一个利用骨架和人类解析模态进行动作识别的。第一个人体姿态分支将稳健的骨架输入到图卷积网络中以建模姿态特征,而第二个人体解析分支则利用表示性卷积后端利用表示性卷积特征图建模通过表示性后端进行解析特征。通过晚融合策略将两个高级特征有效地结合以提高动作识别效果。在NTU RGB+D和NTU RGB+D 120基准测试中,我们进行了广泛的实验,验证了我们提出的EPP-Net的有效性,该方法超越了现有的动作识别方法。我们的代码可以从以下链接获得:https://this URL。
https://arxiv.org/abs/2401.02138
Occluded person re-identification (ReID) is a very challenging task due to the occlusion disturbance and incomplete target information. Leveraging external cues such as human pose or parsing to locate and align part features has been proven to be very effective in occluded person ReID. Meanwhile, recent Transformer structures have a strong ability of long-range modeling. Considering the above facts, we propose a Teacher-Student Decoder (TSD) framework for occluded person ReID, which utilizes the Transformer decoder with the help of human parsing. More specifically, our proposed TSD consists of a Parsing-aware Teacher Decoder (PTD) and a Standard Student Decoder (SSD). PTD employs human parsing cues to restrict Transformer's attention and imparts this information to SSD through feature distillation. Thereby, SSD can learn from PTD to aggregate information of body parts automatically. Moreover, a mask generator is designed to provide discriminative regions for better ReID. In addition, existing occluded person ReID benchmarks utilize occluded samples as queries, which will amplify the role of alleviating occlusion interference and underestimate the impact of the feature absence issue. Contrastively, we propose a new benchmark with non-occluded queries, serving as a complement to the existing benchmark. Extensive experiments demonstrate that our proposed method is superior and the new benchmark is essential. The source codes are available at this https URL.
遮挡的人重新识别(ReID)是一个非常具有挑战性的任务,由于遮挡干扰和缺乏目标信息,利用外部线索如人体姿态或解析来定位和对齐部分特征在遮挡的人重新识别中已经被证明非常有效。同时,最近使用的Transformer结构具有很强的长距离建模能力。考虑到上述事实,我们提出了一个教师-学生解码器(TSD)框架来进行遮挡的人重新识别,该框架利用了人类解析来辅助Transformer解码器。具体来说,我们提出的TSD由一个解析意识到的教师解码器(PTD)和一个标准学生解码器(SSD)组成。PTD利用人类解析线索来限制Transformer的注意力和传递信息给SSD通过特征蒸馏。因此,SSD可以从PTD中学到自动聚合身体部位的信息。此外,还设计了一个掩码生成器,用于提供更好的ReID。此外,现有的遮挡人重新识别基准采用遮挡样本作为查询,这会放大缓解遮挡干扰的作用,低估特征缺失问题的影响。相反,我们提出了一个新基准,作为现有基准的补充。大量实验证明,与我们的方法相比,我们的方法优越,新基准至关重要。源代码可以从该链接https://www.example.com/中获取。
https://arxiv.org/abs/2312.09797