Gait recognition is a significant biometric technique for person identification, particularly in scenarios where other physiological biometrics are impractical or ineffective. In this paper, we address the challenges associated with gait recognition and present a novel approach to improve its accuracy and reliability. The proposed method leverages advanced techniques, including sequential gait landmarks obtained through the Mediapipe pose estimation model, Procrustes analysis for alignment, and a Siamese biGRU-dualStack Neural Network architecture for capturing temporal dependencies. Extensive experiments were conducted on large-scale cross-view datasets to demonstrate the effectiveness of the approach, achieving high recognition accuracy compared to other models. The model demonstrated accuracies of 95.7%, 94.44%, 87.71%, and 86.6% on CASIA-B, SZU RGB-D, OU-MVLP, and Gait3D datasets respectively. The results highlight the potential applications of the proposed method in various practical domains, indicating its significant contribution to the field of gait recognition.
https://arxiv.org/abs/2412.03498
Existing studies for gait recognition primarily utilized sequences of either binary silhouette or human parsing to encode the shapes and dynamics of persons during walking. Silhouettes exhibit accurate segmentation quality and robustness to environmental variations, but their low information entropy may result in sub-optimal performance. In contrast, human parsing provides fine-grained part segmentation with higher information entropy, but the segmentation quality may deteriorate due to the complex environments. To discover the advantages of silhouette and parsing and overcome their limitations, this paper proposes a novel cross-granularity alignment gait recognition method, named XGait, to unleash the power of gait representations of different granularity. To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces, respectively. Moreover, to explore the complementary knowledge across the features of two representations, we design the Global Cross-granularity Module (GCM) and the Part Cross-granularity Module (PCM) after the two encoders. In particular, the GCM aims to enhance the quality of parsing features by leveraging global features from silhouettes, while the PCM aligns the dynamics of human parts between silhouette and parsing features using the high information entropy in parsing sequences. In addition, to effectively guide the alignment of two representations with different granularity at the part level, an elaborate-designed learnable division mechanism is proposed for the parsing features. Comprehensive experiments on two large-scale gait datasets not only show the superior performance of XGait with the Rank-1 accuracy of 80.5% on Gait3D and 88.3% CCPG but also reflect the robustness of the learned features even under challenging conditions like occlusions and cloth changes.
现有的步态识别研究主要利用二值轮廓序列或人体分割序列来编码行走过程中的人物形状和动态。轮廓表现出准确的分割质量和对环境变化的强大鲁棒性,但其低信息熵可能导致性能不佳。相比之下,人体解析提供了更高信息熵的细粒度部分分割,但由于复杂环境的影响,分割质量可能会下降。为了发现轮廓和解析的优势并克服它们的局限性,本文提出了一种新颖的跨粒度步态识别方法,名为XGait,以释放不同粒度下的步态表示力。为实现这一目标,XGait首先包含两个骨干编码器分支,分别将轮廓序列和解析序列映射到两个潜在空间中。此外,为了探索两种表示特征之间的互补知识,在两个编码器之后设计了全局跨粒度模块(GCM)和部分跨粒度模块(PCM)。特别是,GCM旨在通过利用来自轮廓的全局特征来增强解析特征的质量,而PCM则使用解析序列中的高信息熵对轮廓与解析特征之间的人体部位动态进行对齐。此外,为了在部分级别上有效地指导两种不同粒度表示的对齐,提出了一个精心设计的学习分割机制用于解析特征。在两个大规模步态数据集上的综合实验不仅展示了XGait以80.5%的Rank-1准确率在Gait3D和88.3%CCPG中的优越性能,而且还反映了学习到的特征即使在遮挡和衣物变化等具有挑战性的条件下也具备鲁棒性。
https://arxiv.org/abs/2411.10742
Recently, 3D LiDAR has emerged as a promising technique in the field of gait-based person identification, serving as an alternative to traditional RGB cameras, due to its robustness under varying lighting conditions and its ability to capture 3D geometric information. However, long capture distances or the use of low-cost LiDAR sensors often result in sparse human point clouds, leading to a decline in identification performance. To address these challenges, we propose a sparse-to-dense upsampling model for pedestrian point clouds in LiDAR-based gait recognition, named LidarGSU, which is designed to improve the generalization capability of existing identification models. Our method utilizes diffusion probabilistic models (DPMs), which have shown high fidelity in generative tasks such as image completion. In this work, we leverage DPMs on sparse sequential pedestrian point clouds as conditional masks in a video-to-video translation approach, applied in an inpainting manner. We conducted extensive experiments on the SUSTeck1K dataset to evaluate the generative quality and recognition performance of the proposed method. Furthermore, we demonstrate the applicability of our upsampling model using a real-world dataset, captured with a low-resolution sensor across varying measurement distances.
https://arxiv.org/abs/2410.08680
Gait recognition is a remote biometric technology that utilizes the dynamic characteristics of human movement to identify individuals even under various extreme lighting conditions. Due to the limitation in spatial perception capability inherent in 2D gait representations, LiDAR can directly capture 3D gait features and represent them as point clouds, reducing environmental and lighting interference in recognition while significantly advancing privacy protection. For complex 3D representations, shallow networks fail to achieve accurate recognition, making vision Transformers the foremost prevalent method. However, the prevalence of dumb patches has limited the widespread use of Transformer architecture in gait recognition. This paper proposes a method named HorGait, which utilizes a hybrid model with a Transformer architecture for gait recognition on the planar projection of 3D point clouds from LiDAR. Specifically, it employs a hybrid model structure called LHM Block to achieve input adaptation, long-range, and high-order spatial interaction of the Transformer architecture. Additionally, it uses large convolutional kernel CNNs to segment the input representation, replacing attention windows to reduce dumb patches. We conducted extensive experiments, and the results show that HorGait achieves state-of-the-art performance among Transformer architecture methods on the SUSTech1K dataset, verifying that the hybrid model can complete the full Transformer process and perform better in point cloud planar projection. The outstanding performance of HorGait offers new insights for the future application of the Transformer architecture in gait recognition.
https://arxiv.org/abs/2410.08454
Gait recognition is a rapidly progressing technique for the remote identification of individuals. Prior research predominantly employing 2D sensors to gather gait data has achieved notable advancements; nonetheless, they have unavoidably neglected the influence of 3D dynamic characteristics on recognition. Gait recognition utilizing LiDAR 3D point clouds not only directly captures 3D spatial features but also diminishes the impact of lighting conditions while ensuring privacy protection.The essence of the problem lies in how to effectively extract discriminative 3D dynamic representation from point this http URL this paper, we proposes a method named SpheriGait for extracting and enhancing dynamic features from point clouds for Lidar-based gait recognition. Specifically, it substitutes the conventional point cloud plane projection method with spherical projection to augment the perception of dynamic feature.Additionally, a network block named DAM-L is proposed to extract gait cues from the projected point cloud data. We conducted extensive experiments and the results demonstrated the SpheriGait achieved state-of-the-art performance on the SUSTech1K dataset, and verified that the spherical projection method can serve as a universal data preprocessing technique to enhance the performance of other LiDAR-based gait recognition methods, exhibiting exceptional flexibility and practicality.
步伐识别是一种快速发展的技术,用于远程识别个体。之前的研究主要采用2D传感器收集步伐数据,取得了显著的进步;然而,他们忽略了3D动态特性对识别的影响。利用激光雷达(LiDAR)的3D点云进行步伐识别不仅直接捕捉到3D空间特征,还确保了隐私保护,同时减轻了光照条件的影响。问题在于如何有效地从本文中的这个http URL提取出有区分性的3D动态表示,我们提出了名为SpheriGait的方法,用于从点云中提取和增强动态特征,用于基于激光雷达(Lidar)的步伐识别。具体来说,它用球形投影替代了传统的点云平面投影方法,以增强动态特征的感知。此外,我们还提出了一个名为DAM-L的网络块,用于从投影点云数据中提取步伐线索。我们进行了广泛的实验,结果表明,SpheriGait在SUSTech1K数据集上取得了最先进的性能,并验证了球形投影方法可以成为一种通用的数据预处理技术,以提高其他基于激光雷达(Lidar)的步伐识别方法的性能,具有出色的灵活性和实用性。
https://arxiv.org/abs/2409.11869
Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address the issue of gait recognition in the wild. However, the temporal receptive field obtained by convolution operations is limited for long gait sequences. If directly replacing convolution blocks with visual transformer blocks, the model may not enhance a local temporal receptive field, which is important for covering a complete gait cycle. To address this issue, we design a Global-Local Temporal Receptive Field Network (GLGait). GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field, which mainly consists of a Pseudo Global Temporal Self-Attention (PGTA) and a temporal convolution operation. Specifically, PGTA is used to obtain a pseudo global temporal receptive field with less memory and computation complexity compared with a multi-head self-attention (MHSA). The temporal convolution operation is used to enhance the local temporal receptive field. Besides, it can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field. Furthermore, we also propose a Center-Augmented Triplet Loss (CTL) in GLGait to reduce the intra-class distance and expand the positive samples in the training stage. Extensive experiments show that our method obtains state-of-the-art results on in-the-wild datasets, $i.e.$, Gait3D and GREW. The code is available at this https URL.
基于距离的非侵入性的人脸识别技术引起了学术界和产业界越来越多的关注。尽管在实验室场景中,先进的方法取得了显著的成功,但大多数方法在野外表现不佳。最近,一些基于卷积神经网络(ConvNets)的方法被提出来解决野地中步态识别的问题。然而,卷积操作获得的时域接收域对于较长的步态序列是有限的。如果直接用视觉Transformer块替换卷积模块,模型可能不会增强局部时域接收域,这对于覆盖完整的步态周期的目的很重要。为了解决这个问题,我们设计了一个全局-局部时域接收域网络(GLGait)。GLGait采用一个全局-局部时域模块(GLTM)来建立全局-局部时域接收域,主要是由一个伪全局时域自注意(PGTA)和一个时域卷积操作组成。具体来说,PGTA用于获得比多头自注意(MHSA)更少记忆和计算复杂度的伪全局时域接收域。时域卷积操作用于增强局部时域接收域。此外,它还可以将伪全局时域接收域聚合到真正的全时域接收域中。此外,我们还提出了GLGait中的中心增强三元组损失(CTL),以减少类内距离并扩大训练阶段中的积极样本。大量实验证明,我们的方法在野外数据集(即Gait3D和GREW)上的表现达到了最先进的水平。代码可在此https://url.cn/中获取。
https://arxiv.org/abs/2408.06834
Gait recognition with radio frequency (RF) signals enables many potential applications requiring accurate identification. However, current systems require individuals to be within a line-of-sight (LOS) environment and struggle with low signal-to-noise ratio (SNR) when signals traverse concrete and thick walls. To address these challenges, we present TRGR, a novel transmissive reconfigurable intelligent surface (RIS)-aided gait recognition system. TRGR can recognize human identities through walls using only the magnitude measurements of channel state information (CSI) from a pair of transceivers. Specifically, by leveraging transmissive RIS alongside a configuration alternating optimization algorithm, TRGR enhances wall penetration and signal quality, enabling accurate gait recognition. Furthermore, a residual convolution network (RCNN) is proposed as the backbone network to learn robust human information. Experimental results confirm the efficacy of transmissive RIS, highlighting the significant potential of transmissive RIS in enhancing RF-based gait recognition systems. Extensive experiment results show that TRGR achieves an average accuracy of 97.88\% in identifying persons when signals traverse concrete walls, demonstrating the effectiveness and robustness of TRGR.
利用无线电频率(RF)信号进行步态识别可以实现许多需要准确识别的应用。然而,当前系统要求个人处于可视范围内,并且在信号穿越混凝土和厚墙时,信号与噪声比(SNR)较低。为解决这些挑战,我们提出了TRGR,一种新颖的可重构智能表面(RIS)辅助步态识别系统。TRGR可以通过仅从一对天线测量通道状态信息(CSI)来识别人类身份来识别墙。具体来说,通过在传输RIS和配置交替优化算法的基础上进行可重构,TRGR提高了墙穿能力和信号质量,实现了准确的步态识别。此外,还提出了残差卷积神经网络(RCNN)作为骨干网络来学习稳健的人体信息。实验结果证实了可重构RIS的有效性,突出了在基于RF的步态识别系统中的可重构RIS的显著潜力。大量实验结果表明,TRGR在穿越混凝土墙时识别人员的平均准确率为97.88%,展示了TRGR的有效性和鲁棒性。
https://arxiv.org/abs/2407.21566
Biometric recognition has primarily addressed closed-set identification, assuming all probe subjects are in the gallery. However, most practical applications involve open-set biometrics, where probe subjects may or may not be present in the gallery. This poses distinct challenges in effectively distinguishing individuals in the gallery while minimizing false detections. While it is commonly believed that powerful biometric models can excel in both closed- and open-set scenarios, existing loss functions are inconsistent with open-set evaluation. They treat genuine (mated) and imposter (non-mated) similarity scores symmetrically and neglect the relative magnitudes of imposter scores. To address these issues, we simulate open-set evaluation using minibatches during training and introduce novel loss functions: (1) the identification-detection loss optimized for open-set performance under selective thresholds and (2) relative threshold minimization to reduce the maximum negative score for each probe. Across diverse biometric tasks, including face recognition, gait recognition, and person re-identification, our experiments demonstrate the effectiveness of the proposed loss functions, significantly enhancing open-set performance while positively impacting closed-set performance. Our code and models are available at this https URL.
https://arxiv.org/abs/2407.16133
Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns. Existing appearance-based methods utilize CNN or Transformer to extract spatial and temporal features from silhouettes, while model-based methods employ GCN to focus on the special topological structure of skeleton points. However, the quality of silhouettes is limited by complex occlusions, and skeletons lack dense semantic features of the human body. To tackle these problems, we propose a novel gait recognition framework, dubbed Gait Multi-model Aggregation Network (GaitMA), which effectively combines two modalities to obtain a more robust and comprehensive gait representation for recognition. First, skeletons are represented by joint/limb-based heatmaps, and features from silhouettes and skeletons are respectively extracted using two CNN-based feature extractors. Second, a co-attention alignment module is proposed to align the features by element-wise attention. Finally, we propose a mutual learning module, which achieves feature fusion through cross-attention, Wasserstein loss is further introduced to ensure the effective fusion of two modalities. Extensive experimental results demonstrate the superiority of our model on Gait3D, OU-MVLP, and CASIA-B.
https://arxiv.org/abs/2407.14812
Gait recognition is a biometric technology that distinguishes individuals by their walking patterns. However, previous methods face challenges when accurately extracting identity features because they often become entangled with non-identity clues. To address this challenge, we propose CLTD, a causality-inspired discriminative feature learning module designed to effectively eliminate the influence of confounders in triple domains, \ie, spatial, temporal, and spectral. Specifically, we utilize the Cross Pixel-wise Attention Generator (CPAG) to generate attention distributions for factual and counterfactual features in spatial and temporal domains. Then, we introduce the Fourier Projection Head (FPH) to project spatial features into the spectral space, which preserves essential information while reducing computational costs. Additionally, we employ an optimization method with contrastive learning to enforce semantic consistency constraints across sequences from the same subject. Our approach has demonstrated significant performance improvements on challenging datasets, proving its effectiveness. Moreover, it can be seamlessly integrated into existing gait recognition methods.
步伐识别是一种生物识别技术,它通过区分个体的行走模式来识别个体。然而, previous 方法在准确提取身份特征时面临挑战,因为它们通常会纠缠于非个体特征。为解决这个挑战,我们提出了 CLTD,一种以因果关系为导向的判别特征学习模块,旨在有效消除在三个领域(即空间、时间和频域)中混淆因素的影响。具体来说,我们利用 Cross Pixel-wise Attention Generator (CPAG) 生成空间和时间域中的事实和反事实特征的注意力分布。然后,我们引入了 Fourier Projection Head (FPH),将空间特征投影到频域中,保留关键信息的同时降低计算成本。此外,我们还使用一种具有对比学习优化的方法来强制确保同一主题序列之间的语义一致性约束。我们的方法在具有挑战性的数据集上取得了显著的性能提升,证明了其有效性。此外,它可以轻松地集成到现有的步伐识别方法中。
https://arxiv.org/abs/2407.12519
Gait recognition, which aims at identifying individuals by their walking patterns, has achieved great success based on silhouette. The binary silhouette sequence encodes the walking pattern within the sparse boundary representation. Therefore, most pixels in the silhouette are under-sensitive to the walking pattern since the sparse boundary lacks dense spatial-temporal information, which is suitable to be represented with dense texture. To enhance the sensitivity to the walking pattern while maintaining the robustness of recognition, we present a Complementary Learning with neural Architecture Search (CLASH) framework, consisting of walking pattern sensitive gait descriptor named dense spatial-temporal field (DSTF) and neural architecture search based complementary learning (NCL). Specifically, DSTF transforms the representation from the sparse binary boundary into the dense distance-based texture, which is sensitive to the walking pattern at the pixel level. Further, NCL presents a task-specific search space for complementary learning, which mutually complements the sensitivity of DSTF and the robustness of the silhouette to represent the walking pattern effectively. Extensive experiments demonstrate the effectiveness of the proposed methods under both in-the-lab and in-the-wild scenarios. On CASIA-B, we achieve rank-1 accuracy of 98.8%, 96.5%, and 89.3% under three conditions. On OU-MVLP, we achieve rank-1 accuracy of 91.9%. Under the latest in-the-wild datasets, we outperform the latest silhouette-based methods by 16.3% and 19.7% on Gait3D and GREW, respectively.
基于轮廓识别,通过识别行走模式来识别个体,已经在很大程度上取得了成功。二进制轮廓序列根据轮廓稀疏表示编码了行走模式。因此,轮廓中的大多数像素对于行走模式都比较不敏感,因为稀疏边界缺乏密集的空间-时间信息,而密集纹理可以表示这种信息。为了在保持识别的鲁棒性的同时提高对行走模式的敏感性,我们提出了一个互补学习与神经网络架构搜索(CLASH)框架。具体来说,DSTF将二进制轮廓表示从稀疏二进制边界转换为密集距离纹理,对像素层面上的行走模式非常敏感。此外,NCL为互补学习任务提供了一个任务特定的搜索空间,有效地互补DSTF的敏感性和轮廓对行走模式的鲁棒性,使其更加有效地表示行走模式。在广泛的实验中,我们证明了所提出方法在实验室和野外场景中的有效性。在CASIA-B上,我们实现了98.8%,96.5%和89.3%的排名1精度。在OU-MVLP上,我们实现了91.9%的排名1精度。在最新的野外数据集上,我们在Gait3D和GREW上分别比基于轮廓的方法提高了16.3%和19.7%的性能。
https://arxiv.org/abs/2407.03632
Gait recognition is a crucial biometric identification technique. Camera-based gait recognition has been widely applied in both research and industrial fields. LiDAR-based gait recognition has also begun to evolve most recently, due to the provision of 3D structural information. However, in certain applications, cameras fail to recognize persons, such as in low-light environments and long-distance recognition scenarios, where LiDARs work well. On the other hand, the deployment cost and complexity of LiDAR systems limit its wider application. Therefore, it is essential to consider cross-modality gait recognition between cameras and LiDARs for a broader range of applications. In this work, we propose the first cross-modality gait recognition framework between Camera and LiDAR, namely CL-Gait. It employs a two-stream network for feature embedding of both modalities. This poses a challenging recognition task due to the inherent matching between 3D and 2D data, exhibiting significant modality discrepancy. To align the feature spaces of the two modalities, i.e., camera silhouettes and LiDAR points, we propose a contrastive pre-training strategy to mitigate modality discrepancy. To make up for the absence of paired camera-LiDAR data for pre-training, we also introduce a strategy for generating data on a large scale. This strategy utilizes monocular depth estimated from single RGB images and virtual cameras to generate pseudo point clouds for contrastive pre-training. Extensive experiments show that the cross-modality gait recognition is very challenging but still contains potential and feasibility with our proposed model and pre-training strategy. To the best of our knowledge, this is the first work to address cross-modality gait recognition.
行走识别是一种关键的生物识别技术。基于相机的行走识别已经在研究和工业领域得到了广泛应用。最近,基于激光雷达的行走识别也开始发展,因为提供了3D结构信息。然而,在某些应用中,相机无法识别人员,例如在低光环境和远距离识别场景中,激光雷达能更好地工作。另一方面,激光雷达系统的部署成本和复杂性限制了其更广泛的应用。因此,在考虑摄像头和激光雷达之间的跨模态行走识别框架时,在更广泛的应用范围内,跨模态行走识别是至关重要的。 在这项工作中,我们提出了第一个基于摄像头和激光雷达之间的跨模态行走识别框架,即CL-Gait。它采用两个流网络对两种模态的特征进行嵌入。由于3D和2D数据之间固有的匹配,这导致识别任务具有挑战性,并表现出显著的模态差异。为了调整两种模态特征空间的对比度,我们提出了一个先验策略来减轻模态差异。为了弥补预训练中缺乏成对相机-激光雷达数据,我们还引入了一种生成大规模数据策略。该策略利用从单色图像和虚拟相机估计的单目深度来生成对比性预训练伪点云。大量实验证明,跨模态行走识别非常具有挑战性,但仍然包含我们提出的模型和预训练策略的潜力和可行性。据我们所知,这是第一个关注跨模态行走识别的工作。
https://arxiv.org/abs/2407.02038
Gait recognition is a biometric technology that identifies individuals by using walking patterns. Due to the significant achievements of multimodal fusion in gait recognition, we consider employing LiDAR-camera fusion to obtain robust gait representations. However, existing methods often overlook intrinsic characteristics of modalities, and lack fine-grained fusion and temporal modeling. In this paper, we introduce a novel modality-sensitive network LiCAF for LiDAR-camera fusion, which employs an asymmetric modeling strategy. Specifically, we propose Asymmetric Cross-modal Channel Attention (ACCA) and Interlaced Cross-modal Temporal Modeling (ICTM) for cross-modal valuable channel information selection and powerful temporal modeling. Our method achieves state-of-the-art performance (93.9% in Rank-1 and 98.8% in Rank-5) on the SUSTech1K dataset, demonstrating its effectiveness.
的步伐识别是一种生物识别技术,通过分析行走模式来识别个体。由于步伐识别中多模态融合的显著成就,我们考虑采用激光相机融合来获得稳健的步伐表示。然而,现有的方法通常忽视了模态固有特性,并缺乏细粒度的融合和时间建模。在本文中,我们提出了一个新颖的模态敏感网络LiCAF-LIDAR相机融合,采用了一种非对称建模策略。具体来说,我们提出了非对称跨模态通道关注(ACCA)和交错跨模态时间建模(ICTM)来选择跨模态有价值通道信息,实现强大的时间建模。我们的方法在SUSTech1K数据集上实现了最先进的性能(93.9%在Rank-1和98.8%在Rank-5),证明了其有效性。
https://arxiv.org/abs/2406.12355
Existing deep learning methods have made significant progress in gait recognition. Typically, appearance-based models binarize inputs into silhouette sequences. However, mainstream quantization methods prioritize minimizing task loss over quantization error, which is detrimental to gait recognition with binarized inputs. Minor variations in silhouette sequences can be diminished in the network's intermediate layers due to the accumulation of quantization errors. To address this, we propose a differentiable soft quantizer, which better simulates the gradient of the round function during backpropagation. This enables the network to learn from subtle input perturbations. However, our theoretical analysis and empirical studies reveal that directly applying the soft quantizer can hinder network convergence. We further refine the training strategy to ensure convergence while simulating quantization errors. Additionally, we visualize the distribution of outputs from different samples in the feature space and observe significant changes compared to the full precision network, which harms performance. Based on this, we propose an Inter-class Distance-guided Distillation (IDD) strategy to preserve the relative distance between the embeddings of samples with different labels. Extensive experiments validate the effectiveness of our approach, demonstrating state-of-the-art accuracy across various settings and datasets. The code will be made publicly available.
现有的深度学习方法已经在行走识别方面取得了显著的进展。通常,基于外观的模型会将输入转换为轮廓序列。然而,主流的量化方法优先考虑最小化任务损失,这对使用分割输入的行走识别是有害的。由于量化误差在网络的中间层中累积,轮廓序列中的微小变化可能会减弱。为了应对这个问题,我们提出了一个可导的软量化器,它更好地模拟了在反向传播过程中圆函数的梯度。这使得网络能够从微小的输入扰动中学习。然而,我们的理论分析和实证研究结果表明,直接应用软量化器可能会阻碍网络的收敛。为了确保在模拟量化误差的同时达到收敛,我们进一步优化了训练策略。此外,我们绘制了不同样本在特征空间中的输出分布,并观察到与完整精度网络相比,具有显著的变化。基于这一点,我们提出了一个类间距离指导的蒸馏策略(IDD)来保留不同标签样本的嵌入之间的相对距离。大量实验验证了我们的方法的有效性,证明了在各种设置和数据集上的最先进准确性。代码将公开发布。
https://arxiv.org/abs/2405.13859
Gait recognition, a rapidly advancing vision technology for person identification from a distance, has made significant strides in indoor settings. However, evidence suggests that existing methods often yield unsatisfactory results when applied to newly released real-world gait datasets. Furthermore, conclusions drawn from indoor gait datasets may not easily generalize to outdoor ones. Therefore, the primary goal of this work is to present a comprehensive benchmark study aimed at improving practicality rather than solely focusing on enhancing performance. To this end, we first develop OpenGait, a flexible and efficient gait recognition platform. Using OpenGait as a foundation, we conduct in-depth ablation experiments to revisit recent developments in gait recognition. Surprisingly, we detect some imperfect parts of certain prior methods thereby resulting in several critical yet undiscovered insights. Inspired by these findings, we develop three structurally simple yet empirically powerful and practically robust baseline models, i.e., DeepGaitV2, SkeletonGait, and SkeletonGait++, respectively representing the appearance-based, model-based, and multi-modal methodology for gait pattern description. Beyond achieving SoTA performances, more importantly, our careful exploration sheds new light on the modeling experience of deep gait models, the representational capacity of typical gait modalities, and so on. We hope this work can inspire further research and application of gait recognition towards better practicality. The code is available at this https URL.
翻译: 滑步识别,一种用于从距离识别人员的人脸识别技术,已经在室内环境中取得了显著的进步。然而,证据表明,将现有的方法应用于刚发布的现实世界滑步数据时,往往会产生不满意的结果。此外,从室内滑步数据中得出的结论可能不容易应用于户外数据。因此,本工作的主要目标是为提高实用性而不是仅仅关注性能。为此,我们首先开发了OpenGait,一个灵活且高效的滑步识别平台。作为OpenGait的基础,我们进行了深入的消融实验,回顾了滑步识别的最新发展。令人惊讶的是,我们检测到某些先前的方法中的一些不完美之处,从而得出了几个关键但尚未被发现的见解。为了这些发现,我们分别开发了三种结构简单但具有实证 powerful 的基准模型,即DeepGaitV2,SkeletonGait和SkeletonGait++,分别代表基于外观、基于模型的和多模态的滑步模式描述方法。除了实现SoTA性能外,更重要的是,我们的仔细探索揭示了深层滑步模型的建模经验、典型滑步模度的表示能力等。我们希望这项工作能够鼓舞进一步研究并应用滑步识别技术以实现更好的实用性。代码可在此处下载:https://url.cn/xyz
https://arxiv.org/abs/2405.09138
Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.
监视视频资料是一种宝贵的资源和进行姿态分析的机会。然而,这类视频的低质量和高噪声水平可能会严重影响姿态估计算法的准确性,这些算法是可靠姿态分析的基础。现有文献表明,姿态估计的有效性与后续的姿态分析结果之间存在直接关系。一种常见的缓解策略是在噪声数据上对姿态估计模型进行微调,以提高稳健性。然而,这种方法可能会在原始高质量数据上降低下游模型的性能,导致在实践中不必要的权衡。我们提出了一个处理流程,其中包含一个专门针对任务目标进行预处理和增强的监视视频处理模型。我们的预处理和增强模型与最先进的姿态估计网络——HRNet——协同工作,无需反复微调姿态估计模型。此外,我们提出了一种简单而鲁棒的方法,用于自动标注带有姿态的低质量视频,以训练预处理和增强模型。我们系统地评估了我们的预处理模型的性能,并证明我们的方法不仅能在低质量监视视频上实现 improved pose estimation,还能在高质量视频上保留姿态估计的完整性。我们的实验显示,我们的预处理模型在姿态分析性能上明显增强,支持了所提出的利用监视数据进行更可靠姿态分析作为直接微调策略的替代品。我们的贡献为使用监视数据进行更可靠姿态分析在现实应用中铺平道路,而无需考虑数据质量。
https://arxiv.org/abs/2404.12183
Gait is a behavioral biometric modality that can be used to recognize individuals by the way they walk from a far distance. Most existing gait recognition approaches rely on either silhouettes or skeletons, while their joint use is underexplored. Features from silhouettes and skeletons can provide complementary information for more robust recognition against appearance changes or pose estimation errors. To exploit the benefits of both silhouette and skeleton features, we propose a new gait recognition network, referred to as the GaitPoint+. Our approach models skeleton key points as a 3D point cloud, and employs a computational complexity-conscious 3D point processing approach to extract skeleton features, which are then combined with silhouette features for improved accuracy. Since silhouette- or CNN-based methods already require considerable amount of computational resources, it is preferable that the key point learning module is faster and more lightweight. We present a detailed analysis of the utilization of every human key point after the use of traditional max-pooling, and show that while elbow and ankle points are used most commonly, many useful points are discarded by max-pooling. Thus, we present a method to recycle some of the discarded points by a Recycling Max-Pooling module, during processing of skeleton point clouds, and achieve further performance improvement. We provide a comprehensive set of experimental results showing that (i) incorporating skeleton features obtained by a point-based 3D point cloud processing approach boosts the performance of three different state-of-the-art silhouette- and CNN-based baselines; (ii) recycling the discarded points increases the accuracy further. Ablation studies are also provided to show the effectiveness and contribution of different components of our approach.
步伐是一种行为生物测量方法,可以通过观察一个人从远处走来的方式来识别个体。目前的大多数步伐识别方法依赖于轮廓图或骨骼图,而它们之间的联合应用没有被充分利用。轮廓图和骨骼图的特征可以提供互补信息,以应对外貌变化或姿势估计错误。为了充分利用轮廓图和骨骼图的优势,我们提出了一个新的步伐识别网络,称为GaitPoint+。我们的方法将骨骼关键点建模为3D点云,并采用一种计算复杂性友好的3D点处理方法来提取骨骼特征,然后将这些特征与轮廓图特征相结合以提高准确性。由于轮廓图或CNN方法已经需要相当多的计算资源,因此更快的关键点学习模块和更轻量级的骨架网络更受欢迎。我们对使用传统最大池化方法后每个人体关键点的利用率进行了深入分析,并发现,尽管肘部和足踝关键点最常用,但许多有用的关键点却被最大池化丢弃了。因此,我们提出了一种通过回收被丢弃的关键点来提高骨架点云处理过程性能的方法,并在处理骨架点云的过程中实现进一步的性能提升。我们提供了全面的一组实验结果,表明:(i)通过基于点的方法对3D点云处理技术获得的骨骼特征可以提高三种最先进的轮廓图和CNN基站的性能;(ii)回收被丢弃的关键点可以进一步提高准确性。我们还提供了消融研究,以显示我们方法的不同组件的有效性和贡献。
https://arxiv.org/abs/2404.10213
Current gait recognition research mainly focuses on identifying pedestrians captured by the same type of sensor, neglecting the fact that individuals may be captured by different sensors in order to adapt to various environments. A more practical approach should involve cross-modality matching across different sensors. Hence, this paper focuses on investigating the problem of cross-modality gait recognition, with the objective of accurately identifying pedestrians across diverse vision sensors. We present CrossGait inspired by the feature alignment strategy, capable of cross retrieving diverse data modalities. Specifically, we investigate the cross-modality recognition task by initially extracting features within each modality and subsequently aligning these features across modalities. To further enhance the cross-modality performance, we propose a Prototypical Modality-shared Attention Module that learns modality-shared features from two modality-specific features. Additionally, we design a Cross-modality Feature Adapter that transforms the learned modality-specific features into a unified feature space. Extensive experiments conducted on the SUSTech1K dataset demonstrate the effectiveness of CrossGait: (1) it exhibits promising cross-modality ability in retrieving pedestrians across various modalities from different sensors in diverse scenes, and (2) CrossGait not only learns modality-shared features for cross-modality gait recognition but also maintains modality-specific features for single-modality recognition.
当前的步态识别研究主要集中在识别由相同类型传感器捕捉的行人,忽视了个人可能被不同类型的传感器捕捉的事实,以适应各种环境。更实际的方法应该涉及不同传感器之间的跨模态匹配。因此,本文重点研究了跨模态步行识别问题,以准确识别不同视觉传感器捕捉的行人。我们提出了CrossGait,这是一种基于特征对齐策略的步行识别方法,具有跨检索不同数据模态的能力。具体来说,我们研究了跨模态识别任务,首先在每种模态中提取特征,然后在这些特征之间进行对齐。为了进一步提高跨模态性能,我们提出了一个原型模态共享注意模块,从两个模态特定的特征中学习模态共享特征。此外,我们还设计了一个Cross-模态特征适配器,将学习到的模态特定特征转换为统一特征空间。在SUSTech1K数据集上进行的大量实验证明CrossGait的有效性:(1)它表现出在不同场景的多样传感器中检索行人具有希望的跨模态能力;(2)CrossGait不仅学习跨模态步行识别的模态共享特征,还保留单模态识别的模态特定特征。
https://arxiv.org/abs/2404.04120
Gait recognition aims to identify a person based on their walking sequences, serving as a useful biometric modality as it can be observed from long distances without requiring cooperation from the subject. In representing a person's walking sequence, silhouettes and skeletons are the two primary modalities used. Silhouette sequences lack detailed part information when overlapping occurs between different body segments and are affected by carried objects and clothing. Skeletons, comprising joints and bones connecting the joints, provide more accurate part information for different segments; however, they are sensitive to occlusions and low-quality images, causing inconsistencies in frame-wise results within a sequence. In this paper, we explore the use of a two-stream representation of skeletons for gait recognition, alongside silhouettes. By fusing the combined data of silhouettes and skeletons, we refine the two-stream skeletons, joints, and bones through self-correction in graph convolution, along with cross-modal correction with temporal consistency from silhouettes. We demonstrate that with refined skeletons, the performance of the gait recognition model can achieve further improvement on public gait recognition datasets compared with state-of-the-art methods without extra annotations.
翻译: 步态识别的目的是根据一个人的行走序列来识别这个人,作为一个有用的生物测量指标,因为它可以从很远的距离上通过合作观察到,而不需要对被测者进行合作。在表示一个人的行走序列时,轮廓和骨架是两种主要的模式。轮廓序列在多个身体部位之间发生重叠时缺乏详细的部分信息,并受到携带物品和服装的影响。骨架,由连接关节的关节和骨头组成,提供不同部位更准确的部分信息;然而,它们对遮挡和低质量图像敏感,导致序列中的帧结果不一致。在本文中,我们探讨了使用骨架的双流表示方法进行步态识别,同时使用轮廓。通过将轮廓和骨架的合并数据进行融合,我们通过自校正的图卷积和对时一致的跨模态校正在轮廓和骨架上进行优化。我们证明了,通过优化骨架,步态识别模型的性能可以在没有额外注释的公共步态识别数据集上实现比最先进方法更进一步的改进。
https://arxiv.org/abs/2404.02345
Each person has a unique gait, i.e., walking style, that can be used as a biometric for personal identification. Recent works have demonstrated effective gait recognition using deep neural networks, however most of these works predominantly focus on classification accuracy rather than model efficiency. In order to perform gait recognition using wearable devices on the edge, it is imperative to develop highly efficient low-power models that can be deployed on to small form-factor devices such as microcontrollers. In this paper, we propose a small CNN model with 4 layers that is very amenable for edge AI deployment and realtime gait recognition. This model was trained on a public gait dataset with 20 classes augmented with data collected by the authors, aggregating to 24 classes in total. Our model achieves 96.7% accuracy and consumes only 5KB RAM with an inferencing time of 70 ms and 125mW power, while running continuous inference on Arduino Nano 33 BLE Sense. We successfully demonstrated realtime identification of the authors with the model running on Arduino, thus underscoring the efficacy and providing a proof of feasiblity for deployment in practical systems in near future.
每个人的步态都是独特的,也就是行走方式,可以作为个人识别的生物特征。最近的工作已经证明了使用深度神经网络有效识别人类步态,然而,大多数这些工作主要关注分类精度而不是模型效率。要在边缘使用可穿戴设备进行步态识别,就必须开发出能够在小尺寸设备上部署的高效低功耗模型。在本文中,我们提出了一个4层的紧凑型CNN模型,对边缘AI部署非常具有亲和力,并且可以实现实时步态识别。这个模型在由作者收集的20个类别的公开步态数据集上进行训练,累积到24个类别。我们的模型实现96.7%的准确率,并且在使用Arduino Nano 33 BLE Sense进行推理时,仅消耗5KB的RAM,推理时间为70ms,功率为125mW。通过在Arduino上运行我们的模型,我们成功实现了作者的实时识别,从而突出了其有效性和为即将到来的实际系统提供可行性的证明。
https://arxiv.org/abs/2404.15312