Gait recognition offers a non-intrusive biometric solution by identifying individuals through their walking patterns. Although discriminative models have achieved notable success in this domain, the full potential of generative models remains largely underexplored. In this paper, we introduce \textbf{CoD$^2$}, a novel framework that combines the data distribution modeling capabilities of diffusion models with the semantic representation learning strengths of discriminative models to extract robust gait features. We propose a Multi-level Conditional Control strategy that incorporates both high-level identity-aware semantic conditions and low-level visual details. Specifically, the high-level condition, extracted by the discriminative extractor, guides the generation of identity-consistent gait sequences, whereas low-level visual details, such as appearance and motion, are preserved to enhance consistency. Furthermore, the generated sequences facilitate the discriminative extractor's learning, enabling it to capture more comprehensive high-level semantic features. Extensive experiments on four datasets (SUSTech1K, CCPG, GREW, and Gait3D) demonstrate that CoD$^2$ achieves state-of-the-art performance and can be seamlessly integrated with existing discriminative methods, yielding consistent improvements.
https://arxiv.org/abs/2511.06245
Deep learning-based gait recognition has achieved great success in various applications. The key to accurate gait recognition lies in considering the unique and diverse behavior patterns in different motion regions, especially when covariates affect visual appearance. However, existing methods typically use predefined regions for temporal modeling, with fixed or equivalent temporal scales assigned to different types of regions, which makes it difficult to model motion regions that change dynamically over time and adapt to their specific patterns. To tackle this problem, we introduce a Region-aware Dynamic Aggregation and Excitation framework (GaitRDAE) that automatically searches for motion regions, assigns adaptive temporal scales and applies corresponding attention. Specifically, the framework includes two core modules: the Region-aware Dynamic Aggregation (RDA) module, which dynamically searches the optimal temporal receptive field for each region, and the Region-aware Dynamic Excitation (RDE) module, which emphasizes the learning of motion regions containing more stable behavior patterns while suppressing attention to static regions that are more susceptible to covariates. Experimental results show that GaitRDAE achieves state-of-the-art performance on several benchmark datasets.
基于深度学习的步态识别在各种应用中取得了巨大成功。准确进行步态识别的关键在于考虑不同运动区域的独特和多样化的行为模式,特别是在协变量影响视觉外观的情况下。然而,现有的方法通常使用预定义的区域来进行时间建模,并且给不同类型区域分配固定或等同的时间尺度,这使得难以对随时间动态变化的运动区域及其特定模式进行建模与适应。 为了解决这个问题,我们引入了一个基于区域感知的动态聚合和激励框架(GaitRDAE),该框架能够自动搜索运动区域、分配自适应的时间尺度并应用相应的注意力机制。具体来说,该框架包括两个核心模块:区域感知的动态聚合(RDA)模块,它为每个区域动态地寻找最优的时间感受野;以及区域感知的动态激励(RDE)模块,该模块强调学习包含更稳定行为模式的运动区域,并抑制对更容易受到协变量影响的静态区域的关注。 实验结果显示,GaitRDAE在几个基准数据集上达到了最先进的性能。
https://arxiv.org/abs/2510.16541
Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50°), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.
步态识别是一种重要的生物特征,用于在远距离下进行人体识别,尤其是在低分辨率或非约束环境中。当前的研究通常集中在二维表示(如轮廓和骨架)或三维表示(如网格和SMPL模型)上,但依赖单一模态往往无法捕捉到人类行走模式中的完整几何和动态复杂性。在这篇论文中,我们提出了一种多模态、多任务框架,该框架结合了2D时间轮廓与3D SMPL特征,以实现稳健的步态分析。除了识别功能之外,我们还引入了一种多任务学习策略,可以同时执行步态识别和人体属性估计(包括年龄、体质指数BMI以及性别)。一个统一的变压器被用来有效地融合多种模态下的步态特征,并更好地学习与属性相关的表示,同时保留了区分性的身份线索。在大规模BRIAR数据集上进行了广泛的实验,该数据集在具有挑战性条件下收集的数据,例如长距离(高达1公里)和极端俯仰角度(高达50°)。实验结果表明,我们的方法在步态识别方面超越了最先进的方法,并且能够提供准确的人体属性估计。这些成果突显了多模态与多任务学习对于推进基于步态的人类理解在现实世界场景中的应用前景。
https://arxiv.org/abs/2510.10417
Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce Mesh-Gait, a novel end-to-end multi-modal gait recognition framework that directly reconstructs 3D representations from 2D silhouettes, effectively combining the strengths of both modalities. Compared to existing methods, directly learning 3D features from 3D joints or meshes is complex and difficult to fuse with silhouette-based gait features. To overcome this, Mesh-Gait reconstructs 3D heatmaps as an intermediate representation, enabling the model to effectively capture 3D geometric information while maintaining simplicity and computational efficiency. During training, the intermediate 3D heatmaps are gradually reconstructed and become increasingly accurate under supervised learning, where the loss is calculated between the reconstructed 3D joints, virtual markers, and 3D meshes and their corresponding ground truth, ensuring precise spatial alignment and consistent 3D structure. Mesh-Gait extracts discriminative features from both silhouettes and reconstructed 3D heatmaps in a computationally efficient manner. This design enables the model to capture spatial and structural gait characteristics while avoiding the heavy overhead of direct 3D reconstruction from RGB videos, allowing the network to focus on motion dynamics rather than irrelevant visual details. Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy. The code will be released upon acceptance of the paper.
步态识别是一种基本的生物识别技术,利用独特的行走模式进行个体身份识别,通常采用二维表示方法,如轮廓或骨架。然而,这些方法在面对视角变化、遮挡和噪声时往往表现不佳。多模态方法结合3D身体形状信息可以提高鲁棒性,但计算成本较高,限制了其在实时应用中的可行性。为了解决这些问题,我们提出了一种新颖的端到端多模态步态识别框架Mesh-Gait,该框架直接从2D轮廓重建三维表示,有效地融合了两种模式的优点。 与现有的方法相比,直接从3D关节或网格中学习3D特征既复杂又难以与基于轮廓的步态特征相融合。为了克服这一难题,Mesh-Gait 重构了3D热图作为中间表示形式,使模型能够有效捕捉三维几何信息,同时保持简洁性和计算效率。 在训练过程中,在监督学习下,通过计算重建的3D关节、虚拟标记和3D网格与其对应的地面真实值之间的损失来逐步重构中间的3D热图,并且这些热图会变得越来越准确。这确保了精确的空间对齐和一致的三维结构。 Mesh-Gait 能够以计算高效的方式从轮廓和重建的3D热图中提取判别性特征,这种设计使模型能够捕捉到步态的空间和结构性质,同时避免直接从RGB视频中进行复杂的三维重构带来的沉重开销。这样网络可以专注于运动动态而非无关视觉细节。 广泛的实验表明,Mesh-Gait 达到了最先进的精度水平。论文一经接受,代码将公开发布。
https://arxiv.org/abs/2510.10406
Gait recognition is a valuable biometric task that enables the identification of individuals from a distance based on their walking patterns. However, it remains limited by the lack of large-scale labeled datasets and the difficulty of collecting diverse gait samples for each individual while preserving privacy. To address these challenges, we propose GaitCrafter, a diffusion-based framework for synthesizing realistic gait sequences in the silhouette domain. Unlike prior works that rely on simulated environments or alternative generative models, GaitCrafter trains a video diffusion model from scratch, exclusively on gait silhouette data. Our approach enables the generation of temporally consistent and identity-preserving gait sequences. Moreover, the generation process is controllable-allowing conditioning on various covariates such as clothing, carried objects, and view angle. We show that incorporating synthetic samples generated by GaitCrafter into the gait recognition pipeline leads to improved performance, especially under challenging conditions. Additionally, we introduce a mechanism to generate novel identities-synthetic individuals not present in the original dataset-by interpolating identity embeddings. These novel identities exhibit unique, consistent gait patterns and are useful for training models while maintaining privacy of real subjects. Overall, our work takes an important step toward leveraging diffusion models for high-quality, controllable, and privacy-aware gait data generation.
步态识别是一种有价值的生物特征任务,通过个体的行走模式从远处识别个人。然而,这一领域受限于大规模标注数据集的缺乏以及在保护隐私的同时收集每位个体多样化步态样本的难度。为了应对这些挑战,我们提出了GaitCrafter,这是一种基于扩散模型的框架,用于在轮廓域中合成逼真的步态序列。与之前依赖于模拟环境或替代生成模型的方法不同,GaitCrafter从头开始仅使用步态轮廓数据训练视频扩散模型。 我们的方法能够生成时间上一致且能保持身份特征的步态序列。此外,生成过程是可控制的——可以依据各种协变量(如穿着的衣服、携带的物体和视角角度)进行条件设置。我们证明了将GaitCrafter生成的合成样本加入到步态识别流程中能够提升性能,特别是在面对挑战性情况时。 此外,我们还引入了一种机制来生成新身份——在原始数据集中不存在的合成个体,通过插值身份嵌入实现这一目标。这些新的身份具有独特的、一致的步态模式,并且对于在保护实际参与者的隐私的同时训练模型非常有用。 总的来说,我们的工作为利用扩散模型进行高质量、可控和注重隐私的步态数据生成迈出了重要一步。
https://arxiv.org/abs/2508.13300
Recent advancements in gait recognition have significantly enhanced performance by treating silhouettes as either an unordered set or an ordered sequence. However, both set-based and sequence-based approaches exhibit notable limitations. Specifically, set-based methods tend to overlook short-range temporal context for individual frames, while sequence-based methods struggle to capture long-range temporal dependencies effectively. To address these challenges, we draw inspiration from human identification and propose a new perspective that conceptualizes human gait as a composition of individualized actions. Each action is represented by a series of frames, randomly selected from a continuous segment of the sequence, which we term a snippet. Fundamentally, the collection of snippets for a given sequence enables the incorporation of multi-scale temporal context, facilitating more comprehensive gait feature learning. Moreover, we introduce a non-trivial solution for snippet-based gait recognition, focusing on Snippet Sampling and Snippet Modeling as key components. Extensive experiments on four widely-used gait datasets validate the effectiveness of our proposed approach and, more importantly, highlight the potential of gait snippets. For instance, our method achieves the rank-1 accuracy of 77.5% on Gait3D and 81.7% on GREW using a 2D convolution-based backbone.
近期在步态识别领域的进展显著提升了性能,通过将轮廓视作无序集合或有序序列来处理。然而,基于集的方法和基于序列的方法都存在明显的局限性。具体而言,基于集的方法倾向于忽略单个帧的短时上下文信息,而基于序列的方法则难以有效捕捉长时依赖关系。为了解决这些挑战,我们从人类识别中汲取灵感,并提出了一种新的视角,即将人体步态视为个体化动作组合的概念。每个动作由一系列随机选取自连续片段中的帧表示,我们将这种片段称为“snippet(剪辑)”。从根本上说,针对给定序列收集的片段集合能够整合多尺度的时间上下文信息,从而促进更全面的步态特征学习。 此外,我们还提出了一种基于片段的步态识别非平凡解决方案,着重于片段采样和片段建模作为关键组成部分。在四个广泛使用的步态数据集上进行的大量实验验证了所提方法的有效性,并且更重要的是突显了步态剪辑的应用潜力。例如,在使用2D卷积为基础模型的情况下,我们的方法在Gait3D数据集中达到了77.5%的首位准确率,在GREW数据集中则达到了81.7%的首位准确率。
https://arxiv.org/abs/2508.07782
Distributed Acoustic Sensing (DAS) technology finds growing applications across various domains. However, data distribution disparities due to heterogeneous sensing environments pose challenges for data-driven artificial intelligence (AI) models, limiting cross-domain generalization and facing a shortage of labeled training data. To address these issues, this study proposes a foundational model for DAS signal recognition based on a Masked Autoencoder, named MAEPD. The MAEPD model is pretrained on a dataset of 635,860 samples, encompassing DAS gait spatiotemporal signals, 2D GASF images for perimeter security, 2D time-frequency images for pipeline leakage, and open-dataset signals including whale vocalizations and seismic activities, using a self-supervised mask reconstruction task to capture deep semantic features of DAS signals. Visual Prompt Tuning (VPT) is employed for downstream recognition tasks. This method freezes the pretrained backbone parameters and fine-tunes only a small set of learnable visual prompt vectors inserted into the Transformer encoder layers. Experiments on the NVIDIA GeForce RTX 4080 Super platform validate MAEPD using indoor gait recognition as a downstream task. The VPT-Deep approach achieves a classification accuracy of 96.94% with just 0.322% of parameters fine-tuned, surpassing the traditional Full Fine Tuning (FFT) method by 0.61% and reducing training time by 45%. The model also exhibits robust performance in pipeline leakage detection, confirming the generality, efficiency, and scalability of MAEPD as a foundational model. This approach offers a novel paradigm for addressing the limited generalization of signal recognition models in the DAS domain.
分布式声学传感(Distributed Acoustic Sensing,简称DAS)技术在多个领域中的应用日益广泛。然而,在不同异构传感环境中产生的数据分布差异对依赖于大量标记训练数据的数据驱动人工智能模型构成了挑战,限制了这些模型的跨域泛化能力并导致标注训练数据短缺的问题。 为了应对这些问题,本研究提出了一种基于掩码自动编码器(Masked Autoencoder)的基础模型MAEPD用于DAS信号识别。该MAEPD模型在包含635,860个样本的数据集上进行预训练,这些样本包括步态时空信号、周界安全的2D GASF图像、管道泄漏检测的二维时频图以及开放数据集中的鲸鱼声音和地震活动等各类DAS信号。通过自我监督掩码重建任务捕捉DAS信号的深度语义特征。此外,本研究还采用视觉提示调优(Visual Prompt Tuning)方法进行下游识别任务。这种方法冻结了预训练的骨干参数,并仅对插入到Transformer编码层中的可学习视觉提示向量集进行微调。 在NVIDIA GeForce RTX 4080 Super平台上进行的实验验证表明,MAEPD模型能够在室内步态识别这一下游任务中表现出色。VPT-Deep方法仅使用少量参数(仅占总参数的0.322%)微调的情况下就达到了96.94%的分类精度,比传统的全量细调(Full Fine Tuning,FFT)方法高出0.61%,并且训练时间减少了45%。此外,该模型在管道泄漏检测任务中也展示了出色的性能,证明了MAEPD作为基础模型具备通用性、效率和可扩展性的特点。 这种方法为解决DAS领域信号识别模型泛化能力有限的问题提供了新的范式。
https://arxiv.org/abs/2508.04316
Robust gait recognition requires highly discriminative representations, which are closely tied to input modalities. While binary silhouettes and skeletons have dominated recent literature, these 2D representations fall short of capturing sufficient cues that can be exploited to handle viewpoint variations, and capture finer and meaningful details of gait. In this paper, we introduce a novel framework, termed DepthGait, that incorporates RGB-derived depth maps and silhouettes for enhanced gait recognition. Specifically, apart from the 2D silhouette representation of the human body, the proposed pipeline explicitly estimates depth maps from a given RGB image sequence and uses them as a new modality to capture discriminative features inherent in human locomotion. In addition, a novel multi-scale and cross-level fusion scheme has also been developed to bridge the modality gap between depth maps and silhouettes. Extensive experiments on standard benchmarks demonstrate that the proposed DepthGait achieves state-of-the-art performance compared to peer methods and attains an impressive mean rank-1 accuracy on the challenging datasets.
鲁棒的步态识别需要高度区分性的表示,这与输入模态密切相关。虽然二值轮廓和骨架在近期文献中占主导地位,但这些二维表示不足以捕捉足够的可用于处理视角变化的信息,并且也无法捕获步态中的细微而有意义的细节。为此,在本文中我们引入了一个新的框架,称为DepthGait,该框架结合了从RGB图像序列派生出的深度图和轮廓图,以增强步态识别能力。 具体来说,除了人类身体的传统二维轮廓表示之外,所提出的管道还能够直接从给定的RGB图像序列中估计深度图,并将其用作一种新的模态来捕获人体运动中的区分性特征。此外,我们还开发了一种新颖的多尺度和跨级融合方案,以弥合深度图与轮廓图之间的模式差距。 在标准基准测试上的广泛实验表明,提出的DepthGait框架相较于同类方法达到了最先进的性能,并且在一个具有挑战性的数据集上实现了令人印象深刻的平均准确率。
https://arxiv.org/abs/2508.03397
Current gait recognition methodologies generally necessitate retraining when encountering new datasets. Nevertheless, retrained models frequently encounter difficulties in preserving knowledge from previous datasets, leading to a significant decline in performance on earlier test sets. To tackle these challenges, we present a continual gait recognition task, termed GaitAdapt, which supports the progressive enhancement of gait recognition capabilities over time and is systematically categorized according to various evaluation scenarios. Additionally, we propose GaitAdapter, a non-replay continual learning approach for gait recognition. This approach integrates the GaitPartition Adaptive Knowledge (GPAK) module, employing graph neural networks to aggregate common gait patterns from current data into a repository constructed from graph vectors. Subsequently, this repository is used to improve the discriminability of gait features in new tasks, thereby enhancing the model's ability to effectively recognize gait patterns. We also introduce a Euclidean Distance Stability Method (EDSN) based on negative pairs, which ensures that newly added gait samples from different classes maintain similar relative spatial distributions across both previous and current gait tasks, thereby alleviating the impact of task changes on the distinguishability of original domain features. Extensive evaluations demonstrate that GaitAdapter effectively retains gait knowledge acquired from diverse tasks, exhibiting markedly superior discriminative capability compared to alternative methods.
当前的步态识别方法在遇到新的数据集时通常需要重新训练。然而,重新训练后的模型往往难以保持之前数据集中学到的知识,导致早期测试集上的性能显著下降。为了解决这些问题,我们提出了一项连续步态识别任务,称为GaitAdapt,该任务支持步态识别能力的逐步增强,并根据各种评估场景系统地进行分类。 此外,我们提出了一个名为GaitAdapter的方法,这是一种非重放持续学习方法,专门用于步态识别。此方法整合了GaitPartition自适应知识(GPAK)模块,利用图神经网络将当前数据中的通用步态模式聚合到由图向量构建的仓库中。随后,该仓库被用来提升新任务中步态特征的区别能力,从而增强模型有效识别步态模式的能力。 我们还引入了一种基于负样本对的欧氏距离稳定性方法(EDSN),以确保来自不同类别的新增加的步态样本在先前和当前的步态任务之间保持相似的空间分布关系。这样可以减轻任务变化对原始领域特征区分能力的影响。 广泛的评估表明,GaitAdapter能够有效保留从各种任务中获得的步态知识,并且与替代方法相比,在区别能力方面表现出显著的优势。
https://arxiv.org/abs/2508.03375
Gait recognition aims to identify individuals based on their body shape and walking patterns. Though much progress has been achieved driven by deep learning, gait recognition in real-world surveillance scenarios remains quite challenging to current methods. Conventional approaches, which rely on periodic gait cycles and controlled environments, struggle with the non-periodic and occluded silhouette sequences encountered in the wild. In this paper, we propose a novel framework, TrackletGait, designed to address these challenges in the wild. We propose Random Tracklet Sampling, a generalization of existing sampling methods, which strikes a balance between robustness and representation in capturing diverse walking patterns. Next, we introduce Haar Wavelet-based Downsampling to preserve information during spatial downsampling. Finally, we present a Hardness Exclusion Triplet Loss, designed to exclude low-quality silhouettes by discarding hard triplet samples. TrackletGait achieves state-of-the-art results, with 77.8 and 80.4 rank-1 accuracy on the Gait3D and GREW datasets, respectively, while using only 10.3M backbone parameters. Extensive experiments are also conducted to further investigate the factors affecting gait recognition in the wild.
步态识别旨在通过人体形状和行走模式来识别个体。尽管在深度学习的推动下已取得了显著进展,但在实际监控场景中进行步态识别仍然具有挑战性。传统的依赖于周期性的步行循环和受控环境的方法,在面对非周期性和被遮挡的身体轮廓序列时难以应对。 本文提出了一种新型框架 TrackletGait,旨在解决野外环境中遇到的这些挑战。我们提出了随机片段采样(Random Tracklet Sampling),这是一种现有采样方法的泛化版本,能够在捕捉多样化行走模式的同时保持鲁棒性与表现力之间的平衡。接下来,我们介绍了基于哈尔小波的下采样(Haar Wavelet-based Downsampling),该技术在空间下采样的过程中能够保存信息。最后,我们提出了一种难度排除三元组损失(Hardness Exclusion Triplet Loss),旨在通过丢弃难以处理的三元组样本来排除低质量的身体轮廓。 TrackletGait 实现了最先进的结果,在 Gait3D 和 GREW 数据集上分别达到了 77.8% 和 80.4% 的首位准确率,同时仅使用了10.3M个骨干参数。此外,还进行了广泛的实验以进一步探讨影响野外步态识别的因素。 通过这些创新性的方法和技术,TrackletGait 能够在实际监控场景中更有效地实现步态识别。
https://arxiv.org/abs/2508.02143
This work is completed on a whim after discussions with my junior colleague. The motion direction angle affects the micro-Doppler spectrum width, thus determining the human motion direction can provide important prior information for downstream tasks such as gait recognition. However, Doppler-Time map (DTM)-based methods still have room for improvement in achieving feature augmentation and motion determination simultaneously. In response, a low-cost but accurate radar-based human motion direction determination (HMDD) method is explored in this paper. In detail, the radar-based human gait DTMs are first generated, and then the feature augmentation is achieved using feature linking model. Subsequently, the HMDD is implemented through a lightweight and fast Vision Transformer-Convolutional Neural Network hybrid model structure. The effectiveness of the proposed method is verified through open-source dataset. The open-source code of this work is released at: this https URL.
这项工作是在与我的年轻同事讨论后一时兴起完成的。运动方向角影响微多普勒频谱宽度,因此确定人的运动方向可以为步态识别等下游任务提供重要的先验信息。然而,基于多普勒-时间图(DTM)的方法在同时实现特征增强和运动判断方面仍有改进空间。为此,本文探索了一种低成本但准确的雷达基人类运动方向判定(HMDD)方法。具体来说,首先生成基于雷达的人类步态DTM,然后使用特征链接模型实现特征增强。随后,通过轻量级且快速的视觉变换器-卷积神经网络混合模型结构实施HMDD。本文所提出的方法的有效性已在开源数据集上得到验证。本工作的开源代码发布在:此 https URL 链接。
https://arxiv.org/abs/2507.22567
Gait is becoming popular as a method of person re-identification because of its ability to identify people at a distance. However, most current works in gait recognition do not address the practical problem of occlusions. Among those which do, some require paired tuples of occluded and holistic sequences, which are impractical to collect in the real world. Further, these approaches work on occlusions but fail to retain performance on holistic inputs. To address these challenges, we propose RG-Gait, a method for residual correction for occluded gait recognition with holistic retention. We model the problem as a residual learning task, conceptualizing the occluded gait signature as a residual deviation from the holistic gait representation. Our proposed network adaptively integrates the learned residual, significantly improving performance on occluded gait sequences without compromising the holistic recognition accuracy. We evaluate our approach on the challenging Gait3D, GREW and BRIAR datasets and show that learning the residual can be an effective technique to tackle occluded gait recognition with holistic retention.
步态识别作为一种重新识别人员的方法正因其能够在远距离识别人的能力而变得越来越流行。然而,目前大多数关于步态识别的工作并没有解决实际中的遮挡问题。在少数研究遮挡问题的现有工作中,一些方法需要成对的遮挡序列和完整序列数据,这种需求在现实世界中难以获取。此外,这些方法虽然能处理遮挡情况,但在面对无遮挡的情况时性能表现不佳。 为了解决这些问题,我们提出了RG-Gait(残差矫正步态识别)这一方法,旨在解决遮挡下的步态识别问题,并保持对完整序列的识别准确性。我们将该问题视为一个残差学习任务,将被遮挡的步态特征看作是从完整的步态表示中偏离的一个残差。我们的网络能够自适应地整合所学得的残差信息,在不牺牲完全视图下步态识别准确性的前提下,显著提高对部分遮挡序列的识别性能。 我们在具有挑战性的Gait3D、GREW和BRIAR数据集上评估了我们提出的方法,并展示了学习残差可以作为一种有效的技术来处理带有完整视图保留能力的被遮挡步态识别问题。
https://arxiv.org/abs/2507.10978
To capture individual gait patterns, excluding identity-irrelevant cues in walking videos, such as clothing texture and color, remains a persistent challenge for vision-based gait recognition. Traditional silhouette- and pose-based methods, though theoretically effective at removing such distractions, often fall short of high accuracy due to their sparse and less informative inputs. Emerging end-to-end methods address this by directly denoising RGB videos using human priors. Building on this trend, we propose DenoisingGait, a novel gait denoising method. Inspired by the philosophy that "what I cannot create, I do not understand", we turn to generative diffusion models, uncovering how they partially filter out irrelevant factors for gait understanding. Additionally, we introduce a geometry-driven Feature Matching module, which, combined with background removal via human silhouettes, condenses the multi-channel diffusion features at each foreground pixel into a two-channel direction vector. Specifically, the proposed within- and cross-frame matching respectively capture the local vectorized structures of gait appearance and motion, producing a novel flow-like gait representation termed Gait Feature Field, which further reduces residual noise in diffusion features. Experiments on the CCPG, CASIA-B*, and SUSTech1K datasets demonstrate that DenoisingGait achieves a new SoTA performance in most cases for both within- and cross-domain evaluations. Code is available at this https URL.
要捕捉个体步态模式,排除走路视频中的身份无关线索(如衣物纹理和颜色),对于基于视觉的步态识别来说一直是一项挑战。传统的轮廓和姿势方法虽然理论上能有效去除这些干扰因素,但由于其稀疏且信息量较少的输入,在实际应用中往往难以达到高精度。新兴的端到端方法通过直接利用人体先验知识对RGB视频进行去噪来解决这一问题。在此基础上,我们提出了DenoisingGait,一种新型步态去噪方法。 受“我无法创造的东西我不理解”的哲学思想启发,我们将注意力转向生成扩散模型,并发现这些模型部分地过滤出了与步态理解无关的因素。此外,我们还引入了一个几何驱动的特征匹配模块,在通过人体轮廓移除背景后,将多通道扩散特征在每个前景像素处压缩成两个方向向量。 具体来说,提出的帧内和跨帧匹配分别捕捉了步态外观和运动的局部矢量化结构,生成了一种新的流式步态表示,称为步态特征场(Gait Feature Field),进一步减少了扩散特征中的残余噪声。在CCPG、CASIA-B* 和 SUSTech1K 数据集上的实验表明,在域内和跨域评估中,DenoisingGait 在大多数情况下实现了新的最佳性能。代码可在提供的链接中获取。
https://arxiv.org/abs/2505.18582
Large vision models (LVM) based gait recognition has achieved impressive performance. However, existing LVM-based approaches may overemphasize gait priors while neglecting the intrinsic value of LVM itself, particularly the rich, distinct representations across its multi-layers. To adequately unlock LVM's potential, this work investigates the impact of layer-wise representations on downstream recognition tasks. Our analysis reveals that LVM's intermediate layers offer complementary properties across tasks, integrating them yields an impressive improvement even without rich well-designed gait priors. Building on this insight, we propose a simple and universal baseline for LVM-based gait recognition, termed BiggerGait. Comprehensive evaluations on CCPG, CAISA-B*, SUSTech1K, and CCGR\_MINI validate the superiority of BiggerGait across both within- and cross-domain tasks, establishing it as a simple yet practical baseline for gait representation learning. All the models and code will be publicly available.
基于大型视觉模型(Large Vision Model,LVM)的步态识别已经取得了显著的成绩。然而,现有的基于LVM的方法可能过于强调步态先验知识,而忽略了LVM本身内在的价值,特别是其多层结构中的丰富且独特的表示能力。为了充分挖掘LVM的潜力,本研究探讨了逐层表示在下游任务中对步态识别的影响。我们的分析表明,LVM的中间层提供了跨任务互补的特性,即使没有丰富的精心设计的步态先验知识,将它们融合也能带来显著的改进。基于这一见解,我们提出了一种简单而通用的基础模型,用于LVM基的步态识别,并将其命名为BiggerGait。 在CCPG、CAISA-B*、SUSTech1K和CCGR_MINI等多个数据集上进行了全面评估,验证了BiggerGait在跨领域任务中的优越性,确立其为步态表示学习的一个简单而实用的基础模型。所有模型及其代码将在公开平台上提供。
https://arxiv.org/abs/2505.18132
Current exoskeleton control methods often face challenges in delivering personalized treatment. Standardized walking gaits can lead to patient discomfort or even injury. Therefore, personalized gait is essential for the effectiveness of exoskeleton robots, as it directly impacts their adaptability, comfort, and rehabilitation outcomes for individual users. To enable personalized treatment in exoskeleton-assisted therapy and related applications, accurate recognition of personal gait is crucial for implementing tailored gait control. The key challenge in gait recognition lies in effectively capturing individual differences in subtle gait features caused by joint synergy, such as step frequency and step length. To tackle this issue, we propose a novel approach, which uses Multi-Scale Global Dense Graph Convolutional Networks (GCN) in the spatial domain to identify latent joint synergy patterns. Moreover, we propose a Gait Non-linear Periodic Dynamics Learning module to effectively capture the periodic characteristics of gait in the temporal domain. To support our individual gait recognition task, we have constructed a comprehensive gait dataset that ensures both completeness and reliability. Our experimental results demonstrate that our method achieves an impressive accuracy of 94.34% on this dataset, surpassing the current state-of-the-art (SOTA) by 3.77%. This advancement underscores the potential of our approach to enhance personalized gait control in exoskeleton-assisted therapy.
当前的外骨骼控制系统经常面临提供个性化治疗的挑战。标准化的行走步态可能会导致患者不适甚至受伤。因此,个性化的步态对于外骨骼机器人的有效性至关重要,因为它直接影响其对个体用户的适应性、舒适性和康复效果。为了在外骨骼辅助疗法及相关应用中实现个性化治疗,准确识别个人步态对于实施定制化步态控制至关重要。 在步态识别的关键挑战之一在于有效捕捉由关节协同作用引起的细微步态特征的个体差异,如步频和步长。为了解决这一问题,我们提出了一种新颖的方法,该方法使用多尺度全局密集图卷积网络(GCN)来识别潜在的关节协同模式,在空间领域中操作。此外,我们还提出了一个步态非线性周期动力学学习模块,以有效捕捉时间域内步态的周期特征。 为了支持我们的个体步态识别任务,我们构建了一个全面的步态数据集,确保其完整性和可靠性。实验结果显示,我们的方法在这个数据集中达到了94.34%的准确率,比当前最先进的(SOTA)技术高出3.77%,这一进步表明了我们的方法在外骨骼辅助治疗中提升个性化步态控制潜力的巨大前景。
https://arxiv.org/abs/2505.18018
Generalized gait recognition, which aims to achieve robust performance across diverse domains, remains a challenging problem due to severe domain shifts in viewpoints, appearances, and environments. While mixed-dataset training is widely used to enhance generalization, it introduces new obstacles including inter-dataset optimization conflicts and redundant or noisy samples, both of which hinder effective representation learning. To address these challenges, we propose a unified framework that systematically improves cross-domain gait recognition. First, we design a disentangled triplet loss that isolates supervision signals across datasets, mitigating gradient conflicts during optimization. Second, we introduce a targeted dataset distillation strategy that filters out the least informative 20\% of training samples based on feature redundancy and prediction uncertainty, enhancing data efficiency. Extensive experiments on CASIA-B, OU-MVLP, Gait3D, and GREW demonstrate that our method significantly improves cross-dataset recognition for both GaitBase and DeepGaitV2 backbones, without sacrificing source-domain accuracy. Code will be released at this https URL.
通用步态识别旨在实现跨不同领域的稳健性能,但由于视角、外观和环境的严重领域偏移,这仍然是一个具有挑战性的问题。虽然混合数据集训练广泛用于增强泛化能力,但它引入了新的障碍,包括数据集间的优化冲突以及冗余或噪声样本,这些都阻碍了有效的表示学习。为了解决这些问题,我们提出了一种统一框架,系统地改进跨域步态识别性能。 首先,我们设计了一个解耦三元组损失函数,该函数将不同数据集之间的监督信号隔离开来,从而在优化过程中减轻梯度冲突。其次,我们引入了针对性的数据集蒸馏策略,根据特征冗余和预测不确定性筛选出训练样本中信息量最少的20%,以提高数据效率。 在CASIA-B、OU-MVLP、Gait3D和GREW上的大量实验表明,我们的方法显著提高了基于GaitBase和DeepGaitV2骨干网络的跨数据集识别性能,并且没有牺牲源域的准确性。代码将在此 [URL] 发布。
https://arxiv.org/abs/2505.15176
Gait recognition, known for its ability to identify individuals from a distance, has gained significant attention in recent times due to its non-intrusive verification. While video-based gait identification systems perform well on large public datasets, their performance drops when applied to real-world, unconstrained gait data due to various factors. Among these, uncontrolled outdoor environments, non-overlapping camera views, varying illumination, and computational efficiency are core challenges in gait-based authentication. Currently, no dataset addresses all these challenges simultaneously. In this paper, we propose an OptiGait-LGBM model capable of recognizing person re-identification under these constraints using a skeletal model approach, which helps mitigate inconsistencies in a person's appearance. The model constructs a dataset from landmark positions, minimizing memory usage by using non-sequential data. A benchmark dataset, RUET-GAIT, is introduced to represent uncontrolled gait sequences in complex outdoor environments. The process involves extracting skeletal joint landmarks, generating numerical datasets, and developing an OptiGait-LGBM gait classification model. Our aim is to address the aforementioned challenges with minimal computational cost compared to existing methods. A comparative analysis with ensemble techniques such as Random Forest and CatBoost demonstrates that the proposed approach outperforms them in terms of accuracy, memory usage, and training time. This method provides a novel, low-cost, and memory-efficient video-based gait recognition solution for real-world scenarios.
步态识别因其能够从远处识别个人而不侵扰的特点,在近期引起了广泛关注。基于视频的步态识别系统在大型公共数据集上表现出色,但在实际应用中面对无约束环境时性能下降,这主要是由于不受控的户外环境、摄像机视角不一致、光照变化以及计算效率低下等因素造成的。目前尚无数据集同时解决这些挑战。 本文提出了一种名为OptiGait-LGBM的模型,该模型能够在上述限制条件下进行人员再识别,并使用骨骼模型的方法来减少外观上的不一致性。通过利用关键点的位置构建数据集,该方法能够降低内存消耗并处理非连续的数据。我们还引入了一个基准数据集RUET-GAIT,用于表示复杂户外环境中不受控的步态序列。研究过程包括提取骨骼关节的关键点、生成数值化数据集,并开发OptiGait-LGBM步态分类模型。 我们的目标是在与现有方法相比具有较低计算成本的情况下解决上述挑战。通过与随机森林和CatBoost等集成技术进行比较分析,证明了我们提出的方法在准确率、内存使用量以及训练时间方面均优于这些方法。这种方法提供了一种新颖的、低成本且内存高效的视频步态识别解决方案,适用于实际场景中的应用。
https://arxiv.org/abs/2505.08801
Gait recognition has emerged as a powerful tool for unobtrusive and long-range identity analysis, with growing relevance in surveillance and monitoring applications. Although recent advances in deep learning and large-scale datasets have enabled highly accurate recognition under closed-set conditions, real-world deployment demands open-set gait enrollment, which means determining whether a new gait sample corresponds to a known identity or represents a previously unseen individual. In this work, we introduce a transformer-based framework for open-set gait enrollment that is both dataset-agnostic and recognition-architecture-agnostic. Our method leverages a SetTransformer to make enrollment decisions based on the embedding of a probe sample and a context set drawn from the gallery, without requiring task-specific thresholds or retraining for new environments. By decoupling enrollment from the main recognition pipeline, our model is generalized across different datasets, gallery sizes, and identity distributions. We propose an evaluation protocol that uses existing datasets in different ratios of identities and walks per identity. We instantiate our method using skeleton-based gait representations and evaluate it on two benchmark datasets (CASIA-B and PsyMo), using embeddings from three state-of-the-art recognition models (GaitGraph, GaitFormer, and GaitPT). We show that our method is flexible, is able to accurately perform enrollment in different scenarios, and scales better with data compared to traditional approaches. We will make the code and dataset scenarios publicly available.
步态识别作为一种强有力的非侵入性和远距离身份分析工具,已在监控和监测应用中展现出日益重要的作用。尽管深度学习的最新进展以及大规模数据集的应用已经使得在闭合集合条件下实现高度准确的身份识别成为可能,但实际部署需要开放集合步态登记,这意味着必须确定一个新的步态样本是否对应于已知的身份或是之前未见过的个体。 在这项工作中,我们引入了一个基于变换器(Transformer)的框架,用于无数据集和身份认证架构依赖性的开放集合步态注册。我们的方法利用了SetTransformer,通过将探针样本和从图库中抽取的上下文集进行嵌入来做出注册决定,并且无需特定任务的阈值或重新训练以适应新环境。通过将注册过程与主要的身份识别管道解耦,我们的模型在不同的数据集、图库大小和身份分布之间实现了泛化。 我们提出了一种评估协议,利用现有数据集中不同比例的身份和每个身份的不同步态数量进行评价。我们将方法应用到基于骨架的步态表示,并在两个基准数据集(CASIA-B 和 PsyMo)上进行了测试,使用了三种最先进的识别模型(GaitGraph、GaitFormer 和 GaitPT)的嵌入。我们展示了该方法具有灵活性,在不同场景中能够准确执行注册操作,并且与传统方法相比,随着数据量的增长表现得更好。 我们将公开提供代码和数据集情景以供社区使用。
https://arxiv.org/abs/2505.02815
Gait recognition enables contact-free, long-range person identification that is robust to clothing variations and non-cooperative scenarios. While existing methods perform well in controlled indoor environments, they struggle with cross-vertical view scenarios, where surveillance angles vary significantly in elevation. Our experiments show up to 60\% accuracy degradation in low-to-high vertical view settings due to severe deformations and self-occlusions of key anatomical features. Current CNN and self-attention-based methods fail to effectively handle these challenges, due to their reliance on single-scale convolutions or simplistic attention mechanisms that lack effective multi-frequency feature integration. To tackle this challenge, we propose CVVNet (Cross-Vertical-View Network), a frequency aggregation architecture specifically designed for robust cross-vertical-view gait recognition. CVVNet employs a High-Low Frequency Extraction module (HLFE) that adopts parallel multi-scale convolution/max-pooling path and self-attention path as high- and low-frequency mixers for effective multi-frequency feature extraction from input silhouettes. We also introduce the Dynamic Gated Aggregation (DGA) mechanism to adaptively adjust the fusion ratio of high- and low-frequency features. The integration of our core Multi-Scale Attention Gated Aggregation (MSAGA) module, HLFE and DGA enables CVVNet to effectively handle distortions from view changes, significantly improving the recognition robustness across different vertical views. Experimental results show that our CVVNet achieves state-of-the-art performance, with $8.6\%$ improvement on DroneGait and $2\%$ on Gait3D compared with the best existing methods.
步态识别技术能够实现非接触式的长距离人员识别,对服装变化和非配合场景具有较强的鲁棒性。尽管现有的方法在受控的室内环境中表现良好,但在垂直视角变化显著的情况下(例如监控角度高度不同),它们的表现就会大打折扣。实验表明,在低到高垂直视角设置中,由于关键解剖特征出现严重变形和自我遮挡,步态识别准确率最多可下降60%。 目前基于卷积神经网络(CNN)和自注意力机制的方法因依赖单一尺度的卷积操作或简单的注意机制而难以有效应对上述挑战,这些机制缺乏有效的多频段特征整合。为解决这一难题,我们提出了一种专用于稳健垂直视角步态识别的频率聚合架构CVVNet(Cross-Vertical-View Network)。CVVNet采用了高低频提取模块(HLFE),该模块采用平行多尺度卷积/最大池化路径和自注意力路径作为高频和低频混合器,以有效从输入轮廓中提取多频段特征。我们还引入了动态门控聚合(DGA)机制来适应性地调整高、低频特征的融合比率。 通过整合核心的多尺度注意门控聚合(MSAGA)模块、HLFE 和 DGA,CVVNet 能够有效地处理视角变化带来的扭曲问题,并显著提高不同垂直视角下的识别鲁棒性。实验结果表明,我们的 CVVNet 达到了最先进的性能,在 DroneGait 数据集上比现有最佳方法提高了 8.6%,在 Gait3D 数据集上提高了2%。
https://arxiv.org/abs/2505.01837
In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.
在这篇论文中,我们提出了一种新颖的管道系统H-MoRe(Human-centric Motion Representation),用于学习精确的人体运动表示。我们的方法能够动态地保留相关的人类动作同时过滤掉背景移动。值得注意的是,与之前依赖于合成数据进行完全监督学习的方法不同,H-MoRe直接从现实场景中以自监督方式学习,结合了人体姿态和体型信息。 受动力学的启发,H-MoRe 采用矩阵格式来表示每个身体部位的绝对运动和相对运动,并捕捉细微的动作细节,称为世界局部流(world-local flows)。H-MoRe 提供了对人类动作的深入见解,可以无缝地集成到各种与行动相关的应用中。实验结果表明,H-MoRe 在多种下游任务中带来了显著改进,包括步态识别(CL@R1: +16.01%)、行为识别(Acc@1: +8.92%) 和视频生成(FVD: -67.07%)。此外,H-MoRe 还表现出高推断效率(34 fps),适合大多数实时场景应用。 论文发布后将公开模型和代码。
https://arxiv.org/abs/2504.10676