Virtual try-on methods based on diffusion models achieve realistic try-on effects. They use an extra reference network or an additional image encoder to process multiple conditional image inputs, which results in high training costs. Besides, they require more than 25 inference steps, bringing a long inference time. In this work, with the development of diffusion transformer (DiT), we rethink the necessity of reference network or image encoder, then propose MC-VTON, enabling DiT to integrate minimal conditional try-on inputs by utilizing its intrinsic backbone. Compared to existing methods, the superiority of MC-VTON is demonstrated in four aspects: (1)Superior detail fidelity. Our DiT-based MC-VTON exhibits superior fidelity in preserving fine-grained details. (2)Simplified network and inputs. We remove any extra reference network or image encoder. We also remove unnecessary conditions like the long prompt, pose estimation, human parsing, and depth map. We require only the masked person image and the garment image. (3)Parameter-efficient training. To process the try-on task, we fine-tune the FLUX.1-dev with only 39.7M additional parameters 0.33% of the backbone parameters). (4)Less inference steps. We apply distillation diffusion on MC-VTON and only need 8 steps to generate a realistic try-on image, with only 86.8M additional parameters (0.72% of the backbone parameters). Experiments show that MC-VTON achieves superior qualitative and quantitative results with fewer condition inputs, fewer inference steps, and fewer trainable parameters than baseline methods.
基于扩散模型的虚拟试穿方法能够实现逼真的试穿效果,但这些方法通常需要额外的参考网络或图像编码器来处理多条件输入图像,导致训练成本高。此外,它们还需要超过25次推理步骤才能生成一张试穿图片,使得推理时间过长。 随着扩散变换器(Diffusion Transformer, DiT)的发展,我们重新思考了参考网络和图像编码器的必要性,并提出了MC-VTON方法。该方法利用DiT固有的骨干网来最小化条件输入处理,从而实现更高效的虚拟试穿效果。相比现有技术,MC-VTON在以下几个方面表现出显著优势: 1. **细节保真度高**:我们的基于DiT的MC-VTON在保留细微细节方面的性能优于其他模型。 2. **简化网络和输入**:我们移除了额外的参考网络或图像编码器,并且去掉了不必要的条件,如长提示、姿态估计、人体解析和深度图。只需用到遮挡的人体图像和服装图像即可进行试穿模拟。 3. **训练参数更少**:为处理试穿任务,我们在FLUX.1-dev基础上仅增加了3970万个额外参数(占骨干网参数的0.33%)来微调网络模型。 4. **推理步骤更少**:我们对MC-VTON应用了蒸馏扩散技术,在生成逼真的试穿图片时只需要8个步骤,而新增加的参数仅有8680万(占骨干网参数的0.72%)。 实验表明,MC-VTON在条件输入较少、推理步骤更少和训练参数更少的情况下,仍能获得优于基准方法的定性和定量结果。
https://arxiv.org/abs/2501.03630
The gait, as a kind of soft biometric characteristic, can reflect the distinct walking patterns of individuals at a distance, exhibiting a promising technique for unrestrained human identification. With largely excluding gait-unrelated cues hidden in RGB videos, the silhouette and skeleton, though visually compact, have acted as two of the most prevailing gait modalities for a long time. Recently, several attempts have been made to introduce more informative data forms like human parsing and optical flow images to capture gait characteristics, along with multi-branch architectures. However, due to the inconsistency within model designs and experiment settings, we argue that a comprehensive and fair comparative study among these popular gait modalities, involving the representational capacity and fusion strategy exploration, is still lacking. From the perspectives of fine vs. coarse-grained shape and whole vs. pixel-wise motion modeling, this work presents an in-depth investigation of three popular gait representations, i.e., silhouette, human parsing, and optical flow, with various fusion evaluations, and experimentally exposes their similarities and differences. Based on the obtained insights, we further develop a C$^2$Fusion strategy, consequently building our new framework MultiGait++. C$^2$Fusion preserves commonalities while highlighting differences to enrich the learning of gait features. To verify our findings and conclusions, extensive experiments on Gait3D, GREW, CCPG, and SUSTech1K are conducted. The code is available at this https URL.
步态作为一种软生物特征,可以在远距离下反映出个体独特的行走模式,展现了一种有前景的无约束人体识别技术。通过大量排除隐藏在RGB视频中的与步态无关的信息,轮廓和骨架图像虽然视觉上较为紧凑,但长期以来一直是两种最流行的步态表现形式。最近,一些尝试引入更多具有信息量的数据格式,如人体分割图和光流图来捕捉步态特征,并结合多分支架构。然而,由于模型设计和实验设置的不一致性,我们认为这些流行步态模态之间缺乏全面且公平的比较研究,特别是在表征能力和融合策略探索方面。从精细与粗略形状、整体与像素级运动建模的角度出发,本工作深入探讨了三种流行的步态表示形式,即轮廓图、人体分割图和光流图,并进行了各种融合评估,实验上揭示了它们的相似性和差异性。基于获得的见解,我们进一步开发了一种C$^2$Fusion策略,进而构建了新的框架MultiGait++。C$^2$Fusion保留共性的同时突出差异,以丰富步态特征的学习。为了验证我们的发现和结论,在Gait3D、GREW、CCPG和SUSTech1K上进行了广泛的实验。代码可在此链接https URL处获取。
https://arxiv.org/abs/2412.11495
This paper studies a combined person reidentification (re-id) method that uses human parsing, analytical feature extraction and similarity estimation schemes. One of its prominent features is its low computational requirements so it can be implemented on edge devices. The method allows direct comparison of specific image regions using interpretable features which consist of color and texture channels. It is proposed to analyze and compare colors in CIE-Lab color space using histogram smoothing for noise reduction. A novel pre-configured latent space (LS) supervised autoencoder (SAE) is proposed for texture analysis which encodes input textures as LS points. This allows to obtain more accurate similarity measures compared to simplistic label comparison. The proposed method also does not rely upon photos or other re-id data for training, which makes it completely re-id dataset-agnostic. The viability of the proposed method is verified by computing rank-1, rank-10, and mAP re-id metrics on Market1501 dataset. The results are comparable to those of conventional deep learning methods and the potential ways to further improve the method are discussed.
本文研究了一种结合人体解析、特征提取和相似度估计方案的行人重识别(re-id)方法。该方法的一个显著特点是其计算需求低,因此可以在边缘设备上实现。该方法允许使用可解释特征直接比较特定图像区域,这些特征包括颜色和纹理通道。建议在CIE-Lab色彩空间中分析和比较颜色,并通过直方图平滑来减少噪声。为了进行纹理分析,提出了一种预配置的潜在空间(LS)监督自编码器(SAE),它将输入纹理编码为LS点。这相比于简单的标签对比,可以获得更准确的相似度测量结果。所提方法不需要依赖照片或其他重识别数据来进行训练,使其完全不依赖于特定的重识别数据集。通过在Market1501数据集上计算rank-1、rank-10和mAP重识别指标来验证了该方法的有效性。实验结果显示,与传统深度学习方法相比,其性能可比,并讨论了进一步改进该方法的可能性。
https://arxiv.org/abs/2412.05076
Existing studies for gait recognition primarily utilized sequences of either binary silhouette or human parsing to encode the shapes and dynamics of persons during walking. Silhouettes exhibit accurate segmentation quality and robustness to environmental variations, but their low information entropy may result in sub-optimal performance. In contrast, human parsing provides fine-grained part segmentation with higher information entropy, but the segmentation quality may deteriorate due to the complex environments. To discover the advantages of silhouette and parsing and overcome their limitations, this paper proposes a novel cross-granularity alignment gait recognition method, named XGait, to unleash the power of gait representations of different granularity. To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces, respectively. Moreover, to explore the complementary knowledge across the features of two representations, we design the Global Cross-granularity Module (GCM) and the Part Cross-granularity Module (PCM) after the two encoders. In particular, the GCM aims to enhance the quality of parsing features by leveraging global features from silhouettes, while the PCM aligns the dynamics of human parts between silhouette and parsing features using the high information entropy in parsing sequences. In addition, to effectively guide the alignment of two representations with different granularity at the part level, an elaborate-designed learnable division mechanism is proposed for the parsing features. Comprehensive experiments on two large-scale gait datasets not only show the superior performance of XGait with the Rank-1 accuracy of 80.5% on Gait3D and 88.3% CCPG but also reflect the robustness of the learned features even under challenging conditions like occlusions and cloth changes.
现有的步态识别研究主要利用二值轮廓序列或人体分割序列来编码行走过程中的人物形状和动态。轮廓表现出准确的分割质量和对环境变化的强大鲁棒性,但其低信息熵可能导致性能不佳。相比之下,人体解析提供了更高信息熵的细粒度部分分割,但由于复杂环境的影响,分割质量可能会下降。为了发现轮廓和解析的优势并克服它们的局限性,本文提出了一种新颖的跨粒度步态识别方法,名为XGait,以释放不同粒度下的步态表示力。为实现这一目标,XGait首先包含两个骨干编码器分支,分别将轮廓序列和解析序列映射到两个潜在空间中。此外,为了探索两种表示特征之间的互补知识,在两个编码器之后设计了全局跨粒度模块(GCM)和部分跨粒度模块(PCM)。特别是,GCM旨在通过利用来自轮廓的全局特征来增强解析特征的质量,而PCM则使用解析序列中的高信息熵对轮廓与解析特征之间的人体部位动态进行对齐。此外,为了在部分级别上有效地指导两种不同粒度表示的对齐,提出了一个精心设计的学习分割机制用于解析特征。在两个大规模步态数据集上的综合实验不仅展示了XGait以80.5%的Rank-1准确率在Gait3D和88.3%CCPG中的优越性能,而且还反映了学习到的特征即使在遮挡和衣物变化等具有挑战性的条件下也具备鲁棒性。
https://arxiv.org/abs/2411.10742
Virtual try-on methods based on diffusion models achieve realistic try-on effects but often replicate the backbone network as a ReferenceNet or use additional image encoders to process condition inputs, leading to high training and inference costs. In this work, we rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment and person by proposing CatVTON, a simple and efficient virtual try-on diffusion model. CatVTON facilitates the seamless transfer of in-shop or worn garments of any category to target persons by simply concatenating them in spatial dimensions as inputs. The efficiency of our model is demonstrated in three aspects: (1) Lightweight network: Only the original diffusion modules are used, without additional network modules. The text encoder and cross-attentions for text injection in the backbone are removed, reducing the parameters by 167.02M. (2) Parameter-efficient training: We identified the try-on relevant modules through experiments and achieved high-quality try-on effects by training only 49.57M parameters, approximately 5.51 percent of the backbone network's parameters. (3) Simplified inference: CatVTON eliminates all unnecessary conditions and preprocessing steps, including pose estimation, human parsing, and text input, requiring only a garment reference, target person image, and mask for the virtual try-on process. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results with fewer prerequisites and trainable parameters than baseline methods. Furthermore, CatVTON shows good generalization in in-the-wild scenarios despite using open-source datasets with only 73K samples.
https://arxiv.org/abs/2407.15886
Cloth-changing person re-identification (CC-ReID) aims to retrieve specific pedestrians in a cloth-changing scenario. Its main challenge is to disentangle the clothing-related and clothing-unrelated features. Most existing approaches force the model to learn clothing-unrelated features by changing the color of the clothes. However, due to the lack of ground truth, these methods inevitably introduce noise, which destroys the discriminative features and leads to an uncontrollable disentanglement process. In this paper, we propose a new person re-identification network called features reconstruction disentanglement ReID (FRD-ReID), which can controllably decouple the clothing-unrelated and clothing-related features. Specifically, we first introduce the human parsing mask as the ground truth of the reconstruction process. At the same time, we propose the far away attention (FAA) mechanism and the person contour attention (PCA) mechanism for clothing-unrelated features and pedestrian contour features to improve the feature reconstruction efficiency. In the testing phase, we directly discard the clothing-related features for inference,which leads to a controllable disentanglement process. We conducted extensive experiments on the PRCC, LTCC, and Vc-Clothes datasets and demonstrated that our method outperforms existing state-of-the-art methods.
衣物更换人识别(CC-ReID)旨在从换衣场景中检索特定行人。它的主要挑战是区分相关和无关的特征。现有方法通过改变衣服的颜色来强制模型学习无关的特征。然而,由于缺乏真实数据,这些方法不可避免地引入噪声,破坏了可区分特征,导致不可控的解纠缠过程。在本文中,我们提出了一个名为特征重构解纠缠ReID(FRD-ReID)的新人物识别网络,可以控制地解耦相关和无关特征。具体来说,我们首先引入人类解析掩码作为重构过程的地面真值。同时,我们提出了用于服装无关特征和行人轮廓特征的远方注意(FAA)机制和人物轮廓注意(PCA)机制,以提高特征重构效率。在测试阶段,我们直接丢弃与推理相关的衣服特征,从而导致可控制的解纠缠过程。我们对PRCC、LTCC和Vc-Clothes数据集进行了广泛的实验,并证明了我们的方法超越了现有最先进的方法。
https://arxiv.org/abs/2407.10694
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best method in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at \url{this https URL}.
多种任务密集场景理解是一种学会多个密集预测任务模型的方法,具有广泛的应用场景。建模长距离依赖和增强跨任务交互是多任务密集预测的关键。在本文中,我们提出了MTMamba,一种基于Mamba的多任务场景理解新架构。它包含两种核心模块:自任务Mamba(STM)模块和跨任务Mamba(CTM)模块。STM通过利用Mamba处理长距离依赖,而CTM明确建模了任务交互以促进任务间信息交流。在NYUDv2和PASCAL-Context数据集上的实验证明,MTMamba相对于基于Transformer和CNN的方法具有卓越的性能。值得注意的是,在PASCAL-Context数据集上,MTMamba在语义分割、人解析和物体边界检测等任务上分别实现了+2.08、+5.01和+4.90的改进,超过了前最佳方法。代码可在此处访问:\url{https:// this https URL }。
https://arxiv.org/abs/2407.02228
The studies of human clothing for digital avatars have predominantly relied on synthetic datasets. While easy to collect, synthetic data often fall short in realism and fail to capture authentic clothing dynamics. Addressing this gap, we introduce 4D-DRESS, the first real-world 4D dataset advancing human clothing research with its high-quality 4D textured scans and garment meshes. 4D-DRESS captures 64 outfits in 520 human motion sequences, amounting to 78k textured scans. Creating a real-world clothing dataset is challenging, particularly in annotating and segmenting the extensive and complex 4D human scans. To address this, we develop a semi-automatic 4D human parsing pipeline. We efficiently combine a human-in-the-loop process with automation to accurately label 4D scans in diverse garments and body movements. Leveraging precise annotations and high-quality garment meshes, we establish several benchmarks for clothing simulation and reconstruction. 4D-DRESS offers realistic and challenging data that complements synthetic sources, paving the way for advancements in research of lifelike human clothing. Website: this https URL.
为了填补这一空白,我们引入了4D-DRESS,这是第一个通过其高质量的4D纹理扫描和服装网格 Advance 人类服装研究 的真实世界4D数据集。4D-DRESS 捕捉了520个人动序列中的64个着装,共计78k个纹理扫描。创建真实世界的服装数据集具有挑战性,特别是在对广泛而复杂的4D人类扫描进行注释和分割方面。为了应对这个挑战,我们开发了一个半自动化的4D人体解析管道。我们有效地将人机交互过程与自动化相结合,准确地在各种服装和身体运动中标注4D扫描。利用精确的注释和高质量的服装网格,我们为服装模拟和重建建立了多个基准。4D-DRESS 提供了真实和具有挑战性的数据,补充了合成数据,为逼真的人类服装研究铺平了道路。网站:这是这个链接。
https://arxiv.org/abs/2404.18630
The goal of occluded person re-identification (ReID) is to retrieve specific pedestrians in occluded situations. However, occluded person ReID still suffers from background clutter and low-quality local feature representations, which limits model performance. In our research, we introduce a new framework called PAB-ReID, which is a novel ReID model incorporating part-attention mechanisms to tackle the aforementioned issues effectively. Firstly, we introduce the human parsing label to guide the generation of more accurate human part attention maps. In addition, we propose a fine-grained feature focuser for generating fine-grained human local feature representations while suppressing background interference. Moreover, We also design a part triplet loss to supervise the learning of human local features, which optimizes intra/inter-class distance. We conducted extensive experiments on specialized occlusion and regular ReID datasets, showcasing that our approach outperforms the existing state-of-the-art methods.
遮挡人物识别(ReID)的目标是检索遮挡情况下的特定行人。然而,遮挡人物ReID仍然受到背景杂乱和低质量局部特征表示的限制,这限制了模型的性能。在我们的研究中,我们引入了一个新的框架PAB-ReID,这是一种新型的ReID模型,采用了部分注意机制来有效解决上述问题。首先,我们引入了人类解析标签来指导生成更准确的人的部分注意力图。此外,我们提出了一种细粒度特征关注器,用于在抑制背景干扰的同时生成细粒度的人局部特征表示。此外,我们还设计了一个部分三元组损失来指导人局部特征的学习,该损失优化了类内/类间距离。我们在专门的遮挡和普通ReID数据集上进行了广泛的实验,展示了我们的方法超越了现有最先进的方法。
https://arxiv.org/abs/2404.03443
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks, a first of its kind in the field. It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection, addressing the significant challenges posed by overfitting and limited training data in these domains. Our work categorizes data augmentation methods into two main types: data generation and data perturbation. Data generation covers techniques like graphic engine-based generation, generative model-based generation, and data recombination, while data perturbation is divided into image-level and human-level perturbations. Each method is tailored to the unique requirements of human-centric tasks, with some applicable across multiple areas. Our contributions include an extensive literature review, providing deep insights into the influence of these augmentation techniques in human-centric vision and highlighting the nuances of each method. We also discuss open issues and future directions, such as the integration of advanced generative models like Latent Diffusion Models, for creating more realistic and diverse training data. This survey not only encapsulates the current state of data augmentation in human-centric vision but also charts a course for future research, aiming to develop more robust, accurate, and efficient human-centric vision systems.
这项调查对人类中心化视觉任务中的数据增强技术进行了全面分析,是该领域独一无二的。它深入研究了包括人物识别、人解析、人姿势估计和行人检测在内的广泛研究领域,解决了过拟合和有限训练数据在这些领域带来的显著挑战。我们的工作将数据增强方法分为两种主要类型:数据生成和数据扰动。数据生成包括基于图形引擎生成、基于生成模型生成和数据重组等技术,而数据扰动则分为图像级别和人类级别扰动。每种方法都是针对人类中心化任务独特的需求进行定制的,有些方法可以应用于多个领域。我们的贡献包括广泛的文献 review,为这些增强技术在人类中心化视觉和每个方法的影响力提供了深刻的洞察。我们还讨论了未解决的问题和未来的研究方向,例如采用先进的生成模型如潜在扩散模型,以创建更真实和多样化的训练数据。这项调查不仅概括了当前数据增强在人类中心化视觉中的状态,而且为未来的研究奠定了基础,旨在开发更健壮、准确和高效的以人为本视觉系统。
https://arxiv.org/abs/2403.08650
Recent unsupervised person re-identification (re-ID) methods achieve high performance by leveraging fine-grained local context. These methods are referred to as part-based methods. However, most part-based methods obtain local contexts through horizontal division, which suffer from misalignment due to various human poses. Additionally, the misalignment of semantic information in part features restricts the use of metric learning, thus affecting the effectiveness of part-based methods. The two issues mentioned above result in the under-utilization of part features in part-based methods. We introduce the Spatial Cascaded Clustering and Weighted Memory (SCWM) method to address these challenges. SCWM aims to parse and align more accurate local contexts for different human body parts while allowing the memory module to balance hard example mining and noise suppression. Specifically, we first analyze the foreground omissions and spatial confusions issues in the previous method. Then, we propose foreground and space corrections to enhance the completeness and reasonableness of the human parsing results. Next, we introduce a weighted memory and utilize two weighting strategies. These strategies address hard sample mining for global features and enhance noise resistance for part features, which enables better utilization of both global and part features. Extensive experiments on Market-1501 and MSMT17 validate the proposed method's effectiveness over many state-of-the-art methods.
最近,无监督的人重新识别(Re-ID)方法通过利用细粒度局部上下文取得了高性能。这些方法被称为基于部分的(part-based)方法。然而,大多数基于部分的方法通过水平分割获得局部上下文,这会导致因为各种人体姿势而产生的错位。此外,部分特征中的语义信息错位限制了使用指标学习,从而影响了基于部分的方法的有效性。上述两个问题导致基于部分的方法中部分特征的利用率较低。我们引入了空间级联聚类和加权记忆(SCWM)方法来解决这些问题。SCWM旨在解析和校准不同人体部位更准确的局部上下文,同时允许记忆模块平衡难样本挖掘和噪声抑制。具体来说,我们首先分析了前方法中的前景缺失和空间混淆问题。然后,我们提出了前景和空间修正来提高人类解析结果的完整性和合理性。接下来,我们引入了加权记忆,并利用了两种加权策略。这些策略解决了全局特征的难样本挖掘问题,并提高了部分特征的噪声抵抗能力,从而更好地利用全局和部分特征。在Market-1501和MSMT17等大量实验中,我们验证了所提出方法的有效性超过了许多最先进的method。
https://arxiv.org/abs/2403.00261
The paper introduces the Decouple Re-identificatiOn and human Parsing (DROP) method for occluded person re-identification (ReID). Unlike mainstream approaches using global features for simultaneous multi-task learning of ReID and human parsing, or relying on semantic information for attention guidance, DROP argues that the inferior performance of the former is due to distinct granularity requirements for ReID and human parsing features. ReID focuses on instance part-level differences between pedestrian parts, while human parsing centers on semantic spatial context, reflecting the internal structure of the human body. To address this, DROP decouples features for ReID and human parsing, proposing detail-preserving upsampling to combine varying resolution feature maps. Parsing-specific features for human parsing are decoupled, and human position information is exclusively added to the human parsing branch. In the ReID branch, a part-aware compactness loss is introduced to enhance instance-level part differences. Experimental results highlight the efficacy of DROP, especially achieving a Rank-1 accuracy of 76.8% on Occluded-Duke, surpassing two mainstream methods. The codebase is accessible at this https URL.
本文介绍了一种名为Decouple Re-identification and human Parsing (DROP)的方法,用于解决遮挡的人重新识别(ReID)问题。与使用全局特征进行同时多任务学习ReID和人类解析的传统方法不同,或者依赖语义信息进行关注指导的方法,DROP认为,前者的性能差是由于ReID和人类解析特征的细粒度要求不同导致的。ReID关注于行人部分之间的实例级别差异,而人类解析关注于语义空间上下文,反映了人体内部的结构。为了应对这个问题,DROP解耦了ReID和人类解析的特征,提出了一种保留详细信息的上采样方法,结合了不同分辨率特征图。人类解析特定的特征被解耦,而人类位置信息被专门添加到人类解析分支中。在ReID分支中,引入了一种部分感知紧凑性损失,以增强实例级别的部分差异。实验结果强调了DROP的有效性,尤其是实现了在Occluded-Duke上的排名1准确度为76.8%,超过了两个主要方法。代码库可在此处访问:https://github.com/google-research/DROP。
https://arxiv.org/abs/2401.18032
Multimodal-based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts, while filtering out most irrelevant noise. We propose a new dual-branch framework called Ensemble Human Parsing and Pose Network (EPP-Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing festures via convolutional backbones. The two high-level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of our proposed EPP-Net, which outperforms the existing action recognition methods. Our code is available at: this https URL.
多模态动作识别方法通过姿态和RGB模态取得了高度的成功。然而,骨架序列缺乏外观描述,RGB图像因模态限制而受到无关噪声的影响。为了解决这个问题,我们引入了人类解析特征图作为一种新的模式,因为它可以选择性地保留身体部位的有效语义特征,同时过滤出大多数无关噪声。我们提出了一个名为Ensemble Human Parsing and Pose Network (EPP-Net)的新双分支框架,它是第一个利用骨架和人类解析模态进行动作识别的。第一个人体姿态分支将稳健的骨架输入到图卷积网络中以建模姿态特征,而第二个人体解析分支则利用表示性卷积后端利用表示性卷积特征图建模通过表示性后端进行解析特征。通过晚融合策略将两个高级特征有效地结合以提高动作识别效果。在NTU RGB+D和NTU RGB+D 120基准测试中,我们进行了广泛的实验,验证了我们提出的EPP-Net的有效性,该方法超越了现有的动作识别方法。我们的代码可以从以下链接获得:https://this URL。
https://arxiv.org/abs/2401.02138
Occluded person re-identification (ReID) is a very challenging task due to the occlusion disturbance and incomplete target information. Leveraging external cues such as human pose or parsing to locate and align part features has been proven to be very effective in occluded person ReID. Meanwhile, recent Transformer structures have a strong ability of long-range modeling. Considering the above facts, we propose a Teacher-Student Decoder (TSD) framework for occluded person ReID, which utilizes the Transformer decoder with the help of human parsing. More specifically, our proposed TSD consists of a Parsing-aware Teacher Decoder (PTD) and a Standard Student Decoder (SSD). PTD employs human parsing cues to restrict Transformer's attention and imparts this information to SSD through feature distillation. Thereby, SSD can learn from PTD to aggregate information of body parts automatically. Moreover, a mask generator is designed to provide discriminative regions for better ReID. In addition, existing occluded person ReID benchmarks utilize occluded samples as queries, which will amplify the role of alleviating occlusion interference and underestimate the impact of the feature absence issue. Contrastively, we propose a new benchmark with non-occluded queries, serving as a complement to the existing benchmark. Extensive experiments demonstrate that our proposed method is superior and the new benchmark is essential. The source codes are available at this https URL.
遮挡的人重新识别(ReID)是一个非常具有挑战性的任务,由于遮挡干扰和缺乏目标信息,利用外部线索如人体姿态或解析来定位和对齐部分特征在遮挡的人重新识别中已经被证明非常有效。同时,最近使用的Transformer结构具有很强的长距离建模能力。考虑到上述事实,我们提出了一个教师-学生解码器(TSD)框架来进行遮挡的人重新识别,该框架利用了人类解析来辅助Transformer解码器。具体来说,我们提出的TSD由一个解析意识到的教师解码器(PTD)和一个标准学生解码器(SSD)组成。PTD利用人类解析线索来限制Transformer的注意力和传递信息给SSD通过特征蒸馏。因此,SSD可以从PTD中学到自动聚合身体部位的信息。此外,还设计了一个掩码生成器,用于提供更好的ReID。此外,现有的遮挡人重新识别基准采用遮挡样本作为查询,这会放大缓解遮挡干扰的作用,低估特征缺失问题的影响。相反,我们提出了一个新基准,作为现有基准的补充。大量实验证明,与我们的方法相比,我们的方法优越,新基准至关重要。源代码可以从该链接https://www.example.com/中获取。
https://arxiv.org/abs/2312.09797
We propose 360° Volumetric Portrait (3VP) Avatar, a novel method for reconstructing 360° photo-realistic portrait avatars of human subjects solely based on monocular video inputs. State-of-the-art monocular avatar reconstruction methods rely on stable facial performance capturing. However, the common usage of 3DMM-based facial tracking has its limits; side-views can hardly be captured and it fails, especially, for back-views, as required inputs like facial landmarks or human parsing masks are missing. This results in incomplete avatar reconstructions that only cover the frontal hemisphere. In contrast to this, we propose a template-based tracking of the torso, head and facial expressions which allows us to cover the appearance of a human subject from all sides. Thus, given a sequence of a subject that is rotating in front of a single camera, we train a neural volumetric representation based on neural radiance fields. A key challenge to construct this representation is the modeling of appearance changes, especially, in the mouth region (i.e., lips and teeth). We, therefore, propose a deformation-field-based blend basis which allows us to interpolate between different appearance states. We evaluate our approach on captured real-world data and compare against state-of-the-art monocular reconstruction methods. In contrast to those, our method is the first monocular technique that reconstructs an entire 360° avatar.
我们提出了360度立体肖像(3VP)Avatar,这是一种仅基于单目视频输入来重建人类 subject 的 360 度照片现实主义肖像的方法。最先进的单目 Avatar 重建方法依赖于稳定的面部表演捕捉。然而,基于 3DMM 的面部跟踪的常见用法有局限性;侧面视角很难被捕捉到,尤其是在背面视角时,因为缺少面部特征点或人类解析掩码等所需输入。这导致不完整的 Avatar 重建,仅覆盖到前额叶。 相比之下,我们提出了一个基于模板的追踪方案,追踪全身、头部和面部表情,使我们能够从所有侧面覆盖人类 subject 的外观。因此,对于一个在单个相机前旋转的主体的序列,我们基于神经辐射场进行神经体积表示。构建这种表示的一个关键挑战是建模嘴部区域(即嘴唇和牙齿)的外观变化。因此,我们提出了一个变形场为基础的混合基础,使我们能够在不同外观状态之间平滑插值。我们对我们的方法在捕获的现实世界数据上进行评估,并将其与最先进的单目重建方法进行比较。与那些方法相比,我们的方法是第一个仅基于单目的 360 度Avatar 重建方法。
https://arxiv.org/abs/2312.05311
In this paper, we propose a novel virtual try-on from unconstrained designs (ucVTON) task to enable photorealistic synthesis of personalized composite clothing on input human images. Unlike prior arts constrained by specific input types, our method allows flexible specification of style (text or image) and texture (full garment, cropped sections, or texture patches) conditions. To address the entanglement challenge when using full garment images as conditions, we develop a two-stage pipeline with explicit disentanglement of style and texture. In the first stage, we generate a human parsing map reflecting the desired style conditioned on the input. In the second stage, we composite textures onto the parsing map areas based on the texture input. To represent complex and non-stationary textures that have never been achieved in previous fashion editing works, we first propose extracting hierarchical and balanced CLIP features and applying position encoding in VTON. Experiments demonstrate superior synthesis quality and personalization enabled by our method. The flexible control over style and texture mixing brings virtual try-on to a new level of user experience for online shopping and fashion design.
在本文中,我们提出了一个新颖的虚拟试穿方法,从无约束设计(ucVTON)任务中实现对输入人类图像的个性定制服装的渲染。与先前的艺术作品受到特定输入类型的限制不同,我们的方法允许灵活指定风格(文本或图像)和纹理(完整的衣物,裁剪部分或纹理补丁)条件。为了在用完整衣物图像作为条件时解决纠缠挑战,我们开发了一个两阶段流程,其中风格和纹理的显式解离。在第一阶段,我们根据输入生成一个人类解析图,反映所需的风格条件。在第二阶段,我们根据纹理输入在解析图区域上合成纹理。为了代表之前时尚编辑工作中没有实现过的复杂和非平稳纹理,我们首先提出了提取层次结构和平衡的CLIP特征,并在VTON中应用位置编码。实验证明,我们的方法具有卓越的合成质量和个性化的功能,为在线购物和时尚设计带来了全新的用户体验。通过灵活控制风格和纹理混合,使虚拟试穿达到了更高的用户体验水平。
https://arxiv.org/abs/2312.04534
Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through separate branches and distinct output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level segmentation results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By virtual of unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source code, pretrained models, and online demos to facilitate future studies.
多人类标注是一个需要同时考虑实例水平和细粒度类水平信息的图像分割任务。然而,先前的研究通常通过单独的分支和不同的输出格式处理这两种信息,导致效率低下和冗余的框架。本文介绍了一种名为UniParser的方法,将实例水平和类水平表示在三个关键方面进行整合:1)我们提出了一个统一的相关性表示学习方法,使得我们的网络能够在余弦空间中学习实例和类特征;2)我们通过统一的标签和辅助损失在实例和类特征上进行监督,实现了每个模块的输出形式为像素级分割结果;3)我们设计了一个联合优化程序来融合实例和类表示。通过统一实例水平和类水平输出,UniParser绕过了手动设计的后处理技术,超越了最先进的methods,实现了MHPv2.0上的49.3%AP和CIHP上的60.4%AP。我们将发布我们的源代码、预训练模型和在线演示,以促进未来的研究。
https://arxiv.org/abs/2310.08984
Human parsing aims to segment each pixel of the human image with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve the robustness of human parsers, in this paper, we construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions. Specifically, two types of data augmentations from different views, i.e., image-aware augmentation and model-aware image-to-image transformation, are integrated in a sequential manner for adapting to unforeseen image corruptions. The image-aware augmentation can enrich the high diversity of training images with the help of common image operations. The model-aware augmentation strategy that improves the diversity of input data by considering the model's randomness. The proposed method is model-agnostic, and it can plug and play into arbitrary state-of-the-art human parsing frameworks. The experimental results show that the proposed method demonstrates good universality which can improve the robustness of the human parsing models and even the semantic segmentation models when facing various image common corruptions. Meanwhile, it can still obtain approximate performance on clean data.
人类解析旨在以精细的语义分类对人类图像的每一像素进行分割。然而,当前使用干净数据训练的人类解析器容易受到诸如模糊和噪声等常见的图像损坏。为了改善人类解析器的鲁棒性,在本文中,我们建立了三个 corruption 鲁棒性基准,称为 LIP-C、ATR-C 和 Pascal-Person-Part-C,以协助我们评估人类解析模型的风险容忍度。受到数据增强策略启发,我们提出了一种异质增强机制,以在常见的损坏条件下增强鲁棒性。具体来说,从不同视角提供的数据增强有两种类型,即图像意识增强和模型意识的图像到图像变换,通过Sequentially integrated approach 适应意想不到的图像损坏。图像意识增强可以通过常见的图像操作丰富训练图像的高度多样性。模型意识增强策略通过考虑模型的随机性来提高输入数据的多样性。我们提出的方法是非模型特定的,它可以与任意先进的人类解析框架插件和玩耍。实验结果显示,我们提出的方法表现出良好的通用性,可以提高在面临各种常见图像损坏时人类解析模型和语义分割模型的鲁棒性。同时,在干净数据上仍然可以实现近似性能。
https://arxiv.org/abs/2309.00938
Binary silhouettes and keypoint-based skeletons have dominated human gait recognition studies for decades since they are easy to extract from video frames. Despite their success in gait recognition for in-the-lab environments, they usually fail in real-world scenarios due to their low information entropy for gait representations. To achieve accurate gait recognition in the wild, this paper presents a novel gait representation, named Gait Parsing Sequence (GPS). GPSs are sequences of fine-grained human segmentation, i.e., human parsing, extracted from video frames, so they have much higher information entropy to encode the shapes and dynamics of fine-grained human parts during walking. Moreover, to effectively explore the capability of the GPS representation, we propose a novel human parsing-based gait recognition framework, named ParsingGait. ParsingGait contains a Convolutional Neural Network (CNN)-based backbone and two light-weighted heads. The first head extracts global semantic features from GPSs, while the other one learns mutual information of part-level features through Graph Convolutional Networks to model the detailed dynamics of human walking. Furthermore, due to the lack of suitable datasets, we build the first parsing-based dataset for gait recognition in the wild, named Gait3D-Parsing, by extending the large-scale and challenging Gait3D dataset. Based on Gait3D-Parsing, we comprehensively evaluate our method and existing gait recognition methods. The experimental results show a significant improvement in accuracy brought by the GPS representation and the superiority of ParsingGait. The code and dataset are available at this https URL .
二进制轮廓和关键点基于 skeleton 的骨骼结构已经主导了数十年的人步态识别研究,因为它们可以从视频帧中轻松提取。尽管在实验室环境下的人步态识别取得了成功,但在现实世界中通常失败,因为它们在步态表示方面的信息熵较低。为了实现野生状态下准确的步态识别,本文提出了一种新的步态表示方法,称为步态解析序列(GPS), GPS 是由精细的人类分割序列(即人类解析)提取的视频帧序列,因此它们具有更高的信息熵,以编码步行时精细人类部件的形状和动态。此外,为了更好地探索 GPS 表示的能力,我们提出了一种基于人类解析的步态识别框架,称为 ParsingGait。ParsingGait 包含一个卷积神经网络(CNN)基线和一个轻量级头,第一个头从 GPS 中提取全局语义特征,而另一个头通过学习部分级别的特征相互信息,通过图卷积网络模型模拟人类步行的详细动态。此外,由于缺少适当的数据集,我们建立了第一个基于解析的步态识别数据集,称为步态3D-解析,通过扩展大型且具有挑战性的步态3D数据集。基于步态3D-解析,我们全面地评估了我们的方法和现有的步态识别方法。实验结果显示,GPS 表示带来了显著的精度提高,以及 ParsingGait 的优越性。代码和数据集可在 this https URL 上获取。
https://arxiv.org/abs/2308.16739
The fashion e-commerce industry has witnessed significant growth in recent years, prompting exploring image-based virtual try-on techniques to incorporate Augmented Reality (AR) experiences into online shopping platforms. However, existing research has primarily overlooked a crucial aspect - the runtime of the underlying machine-learning model. While existing methods prioritize enhancing output quality, they often disregard the execution time, which restricts their applications on a limited range of devices. To address this gap, we propose Distilled Mobile Real-time Virtual Try-On (DM-VTON), a novel virtual try-on framework designed to achieve simplicity and efficiency. Our approach is based on a knowledge distillation scheme that leverages a strong Teacher network as supervision to guide a Student network without relying on human parsing. Notably, we introduce an efficient Mobile Generative Module within the Student network, significantly reducing the runtime while ensuring high-quality output. Additionally, we propose Virtual Try-on-guided Pose for Data Synthesis to address the limited pose variation observed in training images. Experimental results show that the proposed method can achieve 40 frames per second on a single Nvidia Tesla T4 GPU and only take up 37 MB of memory while producing almost the same output quality as other state-of-the-art methods. DM-VTON stands poised to facilitate the advancement of real-time AR applications, in addition to the generation of lifelike attired human figures tailored for diverse specialized training tasks. this https URL
过去几年,时尚电子商务行业经历了显著增长,这促使我们探索基于图像的虚拟试穿技术,将其引入在线购物平台。然而,现有研究主要忽略了一个关键方面——底层机器学习模型的运行时间。虽然现有方法主要关注提高输出质量,但它们常常忽视了执行时间,这限制了它们在有限设备范围内的应用。为了解决这一差距,我们提出了蒸馏移动实时虚拟试穿(DM-VTON),这是一种创新的虚拟试穿框架,旨在实现简单和高效。我们的方法是基于知识蒸馏计划,利用强大的教师网络作为监督,指导学生网络,而无需依赖人类解析。值得注意的是,我们引入了在学生网络内部的高效移动生成模块, significantly reduce the runtime while ensuring high-quality output。此外,我们提出了虚拟试穿指导姿态的数据合成方法,以解决训练图像中观察到的有限姿态变化。实验结果显示,该方法可以在单个NvidiaTesla T4GPU上实现每秒40帧,仅占用37 MB内存,同时与其他任何先进的方法输出质量几乎相同。DM-VTON已成为推动实时增强现实应用进步的障碍,此外,它还生成定制为各种专业训练任务的生命like服装人物。 this https URL 是 DM-VTON 的一个示例链接。
https://arxiv.org/abs/2308.13798