Gait recognition is emerging as a promising technology and an innovative field within computer vision. However, existing methods typically rely on complex architectures to directly extract features from images and apply pooling operations to obtain sequence-level representations. Such designs often lead to overfitting on static noise (e.g., clothing), while failing to effectively capture dynamic motion this http URL address the above challenges, we present a Language guided and Motion-aware gait recognition framework, named this http URL particular, we utilize designed gait-related language cues to capture key motion features in gait sequences.
步态识别正逐渐成为计算机视觉领域中一项有前景的技术和创新领域。然而,现有的方法通常依赖于复杂的架构直接从图像中提取特征,并应用池化操作来获取序列级别的表示。这样的设计往往会导致过度拟合静态噪声(例如服装),而无法有效捕捉动态运动信息。为了解决上述挑战,我们提出了一种语言引导且注重运动的步态识别框架,命名为LangMoGR。 具体而言,我们利用了专门设计的步态相关语言线索来捕捉步态序列中的关键运动特征。
https://arxiv.org/abs/2601.11931
The goal of gait recognition is to extract identity-invariant features of an individual under various gait conditions, e.g., cross-view and cross-clothing. Most gait models strive to implicitly learn the common traits across different gait conditions in a data-driven manner to pull different gait conditions closer for recognition. However, relatively few studies have explicitly explored the inherent relations between different gait conditions. For this purpose, we attempt to establish connections among different gait conditions and propose a new perspective to achieve gait recognition: variations in different gait conditions can be approximately viewed as a combination of geometric transformations. In this case, all we need is to determine the types of geometric transformations and achieve geometric invariance, then identity invariance naturally follows. As an initial attempt, we explore three common geometric transformations (i.e., Reflect, Rotate, and Scale) and design a $\mathcal{R}$eflect-$\mathcal{R}$otate-$\mathcal{S}$cale invariance learning framework, named ${\mathcal{RRS}}$-Gait. Specifically, it first flexibly adjusts the convolution kernel based on the specific geometric transformations to achieve approximate feature equivariance. Then these three equivariant-aware features are respectively fed into a global pooling operation for final invariance-aware learning. Extensive experiments on four popular gait datasets (Gait3D, GREW, CCPG, SUSTech1K) show superior performance across various gait conditions.
步态识别的目标是从不同步态条件下提取个体的身份不变特征,例如跨视角和跨服装条件。大多数步态模型致力于通过数据驱动的方式隐式学习不同步态条件下的共同特性,以使不同的步态条件在识别时更加接近。然而,相对较少的研究明确探讨了不同步态条件之间的内在关系。为此,我们试图建立不同步态条件之间的联系,并提出了一种新的实现步态识别的视角:不同步态条件的变化可以近似视为几何变换的组合。在这种情况下,我们需要确定各种几何变换类型并实现几何不变性,则身份不变性自然随之而来。 作为初步尝试,我们探索了三种常见的几何变换(即反射、旋转和缩放),并设计了一个名为$\mathcal{RRS}$-Gait的反射-$\mathcal{R}$otate-$\mathcal{S}$cale不变性学习框架。具体来说,它首先根据特定的几何变换灵活调整卷积核以实现近似的特征等变性(equivariance)。然后将这三种具有等变性的特征分别输入全局池化操作进行最终的不变性感知学习。 在四个流行的步态数据集(Gait3D、GREW、CCPG和SUSTech1K)上进行了广泛的实验,结果表明该方法在各种步态条件下均表现出优异性能。
https://arxiv.org/abs/2601.05604
Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL's ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.
人类动作理解通过基于视觉的进步在识别、跟踪和描述方面取得了迅速的发展。然而,大多数现有的方法忽视了诸如关节作用力等生物力学中基本的物理线索。这一差距激发了我们的研究:物理推断出的力量会在何时何地增强对运动的理解?通过将力量融入到已建立的动作理解流程中,我们系统性地评估了它们在三个主要任务上的影响:步态识别、动作识别和细粒度视频描述。在八个基准测试上,引入力量后性能均有持续提升;例如,在CASIA-B数据集上,Rank-1步态识别准确率从89.52%提高到90.39%(+0.87),尤其是在挑战条件下观察到了更大的增益:穿外套时提高了2.7%,侧面视角下提高了3.0%。在Gait3D数据集上,性能也由46.0%提升至47.3%(+1.3)。在动作识别任务中,CTR-GCN模型在Penn Action上的表现提升了2.00%,而高耗力的动作类别如拳击/拍打则获得了高达6.96%的改进。即使是在视频描述领域,Qwen2.5-VL的ROUGE-L评分也从0.310上升至0.339(+0.029),表明物理推断的力量增强了时间定位和语义丰富度。 这些结果证明,在动态、被遮挡或外观变化的情况下,力线索可以显著补充视觉和运动学特征。
https://arxiv.org/abs/2512.20451
Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.
生成式人工智能(GenAI)模型已经革新了动画领域,能够以惊人的视觉逼真度合成人类形象和动作模式。然而,生成真正逼真的真人动画仍然是一个艰巨的挑战,即使是细微的不一致性也会使角色显得不自然。当评估行为生物识别时,这一限制尤为关键,在这种情况下,定义身份的微妙运动线索很容易丢失或失真。本研究调查了最先进的GenAI人体动画模型能否在步态生物识别中保留用于人员识别所需的微小时空细节。具体而言,我们针对四个不同的GenAI模型进行两项主要评估任务来考察其能力:一是从参考视频中恢复步态模式,在不同复杂度条件下;二是将这些步态模式转移到不同的视觉身份上。我们的结果显示,虽然视觉质量大多很高,但在人员识别任务中生物特征的准确性仍然很低,这表明目前的GenAI模型在解耦身份和运动方面存在困难。此外,通过一项身份转换任务,我们揭示了基于外观的步态识别中的一个基本缺陷:当纹理从运动中分离出来时,识别效果会崩溃,证明当前的GenAI模型依赖于视觉属性而非时间动态。
https://arxiv.org/abs/2512.19275
Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:this https URL
准确且可解释的步态分析在帕金森病(PD)的早期检测中起着关键作用,然而大多数现有的方法仍然受到单模态输入、低鲁棒性和临床透明度不足的限制。本文提出了一种基于RGB和深度(RGB-D)数据的可解释多模态框架,旨在识别现实条件下帕金森步态模式。 该系统采用两个基于YOLOv11的编码器来提取特定于每种模态的特征,并随后通过一个多尺度局部-全局提取(MLGE)模块及跨空间颈部融合机制增强了时空表示。这种设计即使在光照不足或衣物遮挡等挑战性场景下,也能捕捉到细微的肢体运动(如减少的手臂摆动)和整体步态动态(如短步长或多转困难)。为了确保可解释性,在该框架中整合了一个冻结的大规模语言模型(LLM),用于将融合后的视觉嵌入与结构化元数据转换为具有临床意义的文字说明。 在多模态步态数据集上的实验评估表明,所提出的RGB-D融合框架相比单输入基线方法实现了更高的识别精度、更强的环境变化鲁棒性和更清晰的视听语言推理。通过结合多模式特征学习和基于语言的理解能力,这项研究弥合了视觉识别与临床理解之间的差距,并为可靠的可解释帕金森步态分析提供了一种新颖的视觉-语言范式。 代码链接:请访问提供的URL查看相关代码。
https://arxiv.org/abs/2512.04425
Gait patterns play a critical role in human identification and healthcare analytics, yet current progress remains constrained by small, narrowly designed models that fail to scale or generalize. Building a unified gait foundation model requires addressing two longstanding barriers: (a) Scalability. Why have gait models historically failed to follow scaling laws? (b) Generalization. Can one model serve the diverse gait tasks that have traditionally been studied in isolation? We introduce FoundationGait, the first scalable, self-supervised pretraining framework for gait understanding. Its largest version has nearly 0.13 billion parameters and is pretrained on 12 public gait datasets comprising over 2 million walking sequences. Extensive experiments demonstrate that FoundationGait, with or without fine-tuning, performs robustly across a wide spectrum of gait datasets, conditions, tasks (e.g., human identification, scoliosis screening, depression prediction, and attribute estimation), and even input modality. Notably, it achieves 48.0% zero-shot rank-1 accuracy on the challenging in-the-wild Gait3D dataset (1,000 test subjects) and 64.5% on the largest in-the-lab OU-MVLP dataset (5,000+ test subjects), setting a new milestone in robust gait recognition. Coming code and model: this https URL.
步态模式在人类识别和医疗数据分析中扮演着关键角色,然而当前的进展受到小型且设计狭窄的模型限制,这些模型无法扩展或泛化。构建一个统一的步态基础模型需要克服两个长期存在的障碍:(a)可扩展性。为什么步态模型历史上未能遵循规模法则?(b)泛化能力。是否有一种单一模型能够服务于传统上孤立研究的各种步态任务? 我们介绍了FoundationGait,这是首个用于步态理解的大规模自监督预训练框架。其最大版本包含近0.13亿个参数,并在包括超过200万步行序列的12个公开步态数据集上进行过预训练。 广泛的实验表明,在有或没有微调的情况下,FoundationGait在各种步态数据集、条件、任务(例如人类识别、脊柱侧弯筛查、抑郁症预测和属性估计)以及输入模式下均表现出色。尤为值得注意的是,在具有挑战性的野外环境下的Gait3D数据集中(1,000个测试对象),它实现了48.0%的零样本Rank-1准确率,并在实验室内的OU-MVLP数据集(超过5,000名测试对象)中达到了64.5%,为稳健的步态识别树立了新的里程碑。 代码和模型:请访问此链接 [this https URL] 获取更多信息。
https://arxiv.org/abs/2512.00691
Appearance-based gait recognition have achieved strong performance on controlled datasets, yet systematic evaluation of its robustness to real-world corruptions and silhouette variability remains lacking. We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems. RobustGait evaluation spans four dimensions: the type of perturbation (digital, environmental, temporal, occlusion), the silhouette extraction method (segmentation and parsing networks), the architectural capacities of gait recognition models, and various deployment scenarios. The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems. We came across several exciting insights. First, applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems. Second, gait accuracy is highly sensitive to silhouette extractor biases, revealing an overlooked source of benchmark bias. Third, robustness is dependent on both the type of perturbation and the architectural design. Finally, we explore robustness-enhancing strategies, showing that noise-aware training and knowledge distillation improve performance and move toward deployment-ready systems.
https://arxiv.org/abs/2511.13065
Gait recognition offers a non-intrusive biometric solution by identifying individuals through their walking patterns. Although discriminative models have achieved notable success in this domain, the full potential of generative models remains largely underexplored. In this paper, we introduce \textbf{CoD$^2$}, a novel framework that combines the data distribution modeling capabilities of diffusion models with the semantic representation learning strengths of discriminative models to extract robust gait features. We propose a Multi-level Conditional Control strategy that incorporates both high-level identity-aware semantic conditions and low-level visual details. Specifically, the high-level condition, extracted by the discriminative extractor, guides the generation of identity-consistent gait sequences, whereas low-level visual details, such as appearance and motion, are preserved to enhance consistency. Furthermore, the generated sequences facilitate the discriminative extractor's learning, enabling it to capture more comprehensive high-level semantic features. Extensive experiments on four datasets (SUSTech1K, CCPG, GREW, and Gait3D) demonstrate that CoD$^2$ achieves state-of-the-art performance and can be seamlessly integrated with existing discriminative methods, yielding consistent improvements.
https://arxiv.org/abs/2511.06245
Deep learning-based gait recognition has achieved great success in various applications. The key to accurate gait recognition lies in considering the unique and diverse behavior patterns in different motion regions, especially when covariates affect visual appearance. However, existing methods typically use predefined regions for temporal modeling, with fixed or equivalent temporal scales assigned to different types of regions, which makes it difficult to model motion regions that change dynamically over time and adapt to their specific patterns. To tackle this problem, we introduce a Region-aware Dynamic Aggregation and Excitation framework (GaitRDAE) that automatically searches for motion regions, assigns adaptive temporal scales and applies corresponding attention. Specifically, the framework includes two core modules: the Region-aware Dynamic Aggregation (RDA) module, which dynamically searches the optimal temporal receptive field for each region, and the Region-aware Dynamic Excitation (RDE) module, which emphasizes the learning of motion regions containing more stable behavior patterns while suppressing attention to static regions that are more susceptible to covariates. Experimental results show that GaitRDAE achieves state-of-the-art performance on several benchmark datasets.
基于深度学习的步态识别在各种应用中取得了巨大成功。准确进行步态识别的关键在于考虑不同运动区域的独特和多样化的行为模式,特别是在协变量影响视觉外观的情况下。然而,现有的方法通常使用预定义的区域来进行时间建模,并且给不同类型区域分配固定或等同的时间尺度,这使得难以对随时间动态变化的运动区域及其特定模式进行建模与适应。 为了解决这个问题,我们引入了一个基于区域感知的动态聚合和激励框架(GaitRDAE),该框架能够自动搜索运动区域、分配自适应的时间尺度并应用相应的注意力机制。具体来说,该框架包括两个核心模块:区域感知的动态聚合(RDA)模块,它为每个区域动态地寻找最优的时间感受野;以及区域感知的动态激励(RDE)模块,该模块强调学习包含更稳定行为模式的运动区域,并抑制对更容易受到协变量影响的静态区域的关注。 实验结果显示,GaitRDAE在几个基准数据集上达到了最先进的性能。
https://arxiv.org/abs/2510.16541
Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50°), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.
步态识别是一种重要的生物特征,用于在远距离下进行人体识别,尤其是在低分辨率或非约束环境中。当前的研究通常集中在二维表示(如轮廓和骨架)或三维表示(如网格和SMPL模型)上,但依赖单一模态往往无法捕捉到人类行走模式中的完整几何和动态复杂性。在这篇论文中,我们提出了一种多模态、多任务框架,该框架结合了2D时间轮廓与3D SMPL特征,以实现稳健的步态分析。除了识别功能之外,我们还引入了一种多任务学习策略,可以同时执行步态识别和人体属性估计(包括年龄、体质指数BMI以及性别)。一个统一的变压器被用来有效地融合多种模态下的步态特征,并更好地学习与属性相关的表示,同时保留了区分性的身份线索。在大规模BRIAR数据集上进行了广泛的实验,该数据集在具有挑战性条件下收集的数据,例如长距离(高达1公里)和极端俯仰角度(高达50°)。实验结果表明,我们的方法在步态识别方面超越了最先进的方法,并且能够提供准确的人体属性估计。这些成果突显了多模态与多任务学习对于推进基于步态的人类理解在现实世界场景中的应用前景。
https://arxiv.org/abs/2510.10417
Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce Mesh-Gait, a novel end-to-end multi-modal gait recognition framework that directly reconstructs 3D representations from 2D silhouettes, effectively combining the strengths of both modalities. Compared to existing methods, directly learning 3D features from 3D joints or meshes is complex and difficult to fuse with silhouette-based gait features. To overcome this, Mesh-Gait reconstructs 3D heatmaps as an intermediate representation, enabling the model to effectively capture 3D geometric information while maintaining simplicity and computational efficiency. During training, the intermediate 3D heatmaps are gradually reconstructed and become increasingly accurate under supervised learning, where the loss is calculated between the reconstructed 3D joints, virtual markers, and 3D meshes and their corresponding ground truth, ensuring precise spatial alignment and consistent 3D structure. Mesh-Gait extracts discriminative features from both silhouettes and reconstructed 3D heatmaps in a computationally efficient manner. This design enables the model to capture spatial and structural gait characteristics while avoiding the heavy overhead of direct 3D reconstruction from RGB videos, allowing the network to focus on motion dynamics rather than irrelevant visual details. Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy. The code will be released upon acceptance of the paper.
步态识别是一种基本的生物识别技术,利用独特的行走模式进行个体身份识别,通常采用二维表示方法,如轮廓或骨架。然而,这些方法在面对视角变化、遮挡和噪声时往往表现不佳。多模态方法结合3D身体形状信息可以提高鲁棒性,但计算成本较高,限制了其在实时应用中的可行性。为了解决这些问题,我们提出了一种新颖的端到端多模态步态识别框架Mesh-Gait,该框架直接从2D轮廓重建三维表示,有效地融合了两种模式的优点。 与现有的方法相比,直接从3D关节或网格中学习3D特征既复杂又难以与基于轮廓的步态特征相融合。为了克服这一难题,Mesh-Gait 重构了3D热图作为中间表示形式,使模型能够有效捕捉三维几何信息,同时保持简洁性和计算效率。 在训练过程中,在监督学习下,通过计算重建的3D关节、虚拟标记和3D网格与其对应的地面真实值之间的损失来逐步重构中间的3D热图,并且这些热图会变得越来越准确。这确保了精确的空间对齐和一致的三维结构。 Mesh-Gait 能够以计算高效的方式从轮廓和重建的3D热图中提取判别性特征,这种设计使模型能够捕捉到步态的空间和结构性质,同时避免直接从RGB视频中进行复杂的三维重构带来的沉重开销。这样网络可以专注于运动动态而非无关视觉细节。 广泛的实验表明,Mesh-Gait 达到了最先进的精度水平。论文一经接受,代码将公开发布。
https://arxiv.org/abs/2510.10406
Gait recognition is a valuable biometric task that enables the identification of individuals from a distance based on their walking patterns. However, it remains limited by the lack of large-scale labeled datasets and the difficulty of collecting diverse gait samples for each individual while preserving privacy. To address these challenges, we propose GaitCrafter, a diffusion-based framework for synthesizing realistic gait sequences in the silhouette domain. Unlike prior works that rely on simulated environments or alternative generative models, GaitCrafter trains a video diffusion model from scratch, exclusively on gait silhouette data. Our approach enables the generation of temporally consistent and identity-preserving gait sequences. Moreover, the generation process is controllable-allowing conditioning on various covariates such as clothing, carried objects, and view angle. We show that incorporating synthetic samples generated by GaitCrafter into the gait recognition pipeline leads to improved performance, especially under challenging conditions. Additionally, we introduce a mechanism to generate novel identities-synthetic individuals not present in the original dataset-by interpolating identity embeddings. These novel identities exhibit unique, consistent gait patterns and are useful for training models while maintaining privacy of real subjects. Overall, our work takes an important step toward leveraging diffusion models for high-quality, controllable, and privacy-aware gait data generation.
步态识别是一种有价值的生物特征任务,通过个体的行走模式从远处识别个人。然而,这一领域受限于大规模标注数据集的缺乏以及在保护隐私的同时收集每位个体多样化步态样本的难度。为了应对这些挑战,我们提出了GaitCrafter,这是一种基于扩散模型的框架,用于在轮廓域中合成逼真的步态序列。与之前依赖于模拟环境或替代生成模型的方法不同,GaitCrafter从头开始仅使用步态轮廓数据训练视频扩散模型。 我们的方法能够生成时间上一致且能保持身份特征的步态序列。此外,生成过程是可控制的——可以依据各种协变量(如穿着的衣服、携带的物体和视角角度)进行条件设置。我们证明了将GaitCrafter生成的合成样本加入到步态识别流程中能够提升性能,特别是在面对挑战性情况时。 此外,我们还引入了一种机制来生成新身份——在原始数据集中不存在的合成个体,通过插值身份嵌入实现这一目标。这些新的身份具有独特的、一致的步态模式,并且对于在保护实际参与者的隐私的同时训练模型非常有用。 总的来说,我们的工作为利用扩散模型进行高质量、可控和注重隐私的步态数据生成迈出了重要一步。
https://arxiv.org/abs/2508.13300
Recent advancements in gait recognition have significantly enhanced performance by treating silhouettes as either an unordered set or an ordered sequence. However, both set-based and sequence-based approaches exhibit notable limitations. Specifically, set-based methods tend to overlook short-range temporal context for individual frames, while sequence-based methods struggle to capture long-range temporal dependencies effectively. To address these challenges, we draw inspiration from human identification and propose a new perspective that conceptualizes human gait as a composition of individualized actions. Each action is represented by a series of frames, randomly selected from a continuous segment of the sequence, which we term a snippet. Fundamentally, the collection of snippets for a given sequence enables the incorporation of multi-scale temporal context, facilitating more comprehensive gait feature learning. Moreover, we introduce a non-trivial solution for snippet-based gait recognition, focusing on Snippet Sampling and Snippet Modeling as key components. Extensive experiments on four widely-used gait datasets validate the effectiveness of our proposed approach and, more importantly, highlight the potential of gait snippets. For instance, our method achieves the rank-1 accuracy of 77.5% on Gait3D and 81.7% on GREW using a 2D convolution-based backbone.
近期在步态识别领域的进展显著提升了性能,通过将轮廓视作无序集合或有序序列来处理。然而,基于集的方法和基于序列的方法都存在明显的局限性。具体而言,基于集的方法倾向于忽略单个帧的短时上下文信息,而基于序列的方法则难以有效捕捉长时依赖关系。为了解决这些挑战,我们从人类识别中汲取灵感,并提出了一种新的视角,即将人体步态视为个体化动作组合的概念。每个动作由一系列随机选取自连续片段中的帧表示,我们将这种片段称为“snippet(剪辑)”。从根本上说,针对给定序列收集的片段集合能够整合多尺度的时间上下文信息,从而促进更全面的步态特征学习。 此外,我们还提出了一种基于片段的步态识别非平凡解决方案,着重于片段采样和片段建模作为关键组成部分。在四个广泛使用的步态数据集上进行的大量实验验证了所提方法的有效性,并且更重要的是突显了步态剪辑的应用潜力。例如,在使用2D卷积为基础模型的情况下,我们的方法在Gait3D数据集中达到了77.5%的首位准确率,在GREW数据集中则达到了81.7%的首位准确率。
https://arxiv.org/abs/2508.07782
Distributed Acoustic Sensing (DAS) technology finds growing applications across various domains. However, data distribution disparities due to heterogeneous sensing environments pose challenges for data-driven artificial intelligence (AI) models, limiting cross-domain generalization and facing a shortage of labeled training data. To address these issues, this study proposes a foundational model for DAS signal recognition based on a Masked Autoencoder, named MAEPD. The MAEPD model is pretrained on a dataset of 635,860 samples, encompassing DAS gait spatiotemporal signals, 2D GASF images for perimeter security, 2D time-frequency images for pipeline leakage, and open-dataset signals including whale vocalizations and seismic activities, using a self-supervised mask reconstruction task to capture deep semantic features of DAS signals. Visual Prompt Tuning (VPT) is employed for downstream recognition tasks. This method freezes the pretrained backbone parameters and fine-tunes only a small set of learnable visual prompt vectors inserted into the Transformer encoder layers. Experiments on the NVIDIA GeForce RTX 4080 Super platform validate MAEPD using indoor gait recognition as a downstream task. The VPT-Deep approach achieves a classification accuracy of 96.94% with just 0.322% of parameters fine-tuned, surpassing the traditional Full Fine Tuning (FFT) method by 0.61% and reducing training time by 45%. The model also exhibits robust performance in pipeline leakage detection, confirming the generality, efficiency, and scalability of MAEPD as a foundational model. This approach offers a novel paradigm for addressing the limited generalization of signal recognition models in the DAS domain.
分布式声学传感(Distributed Acoustic Sensing,简称DAS)技术在多个领域中的应用日益广泛。然而,在不同异构传感环境中产生的数据分布差异对依赖于大量标记训练数据的数据驱动人工智能模型构成了挑战,限制了这些模型的跨域泛化能力并导致标注训练数据短缺的问题。 为了应对这些问题,本研究提出了一种基于掩码自动编码器(Masked Autoencoder)的基础模型MAEPD用于DAS信号识别。该MAEPD模型在包含635,860个样本的数据集上进行预训练,这些样本包括步态时空信号、周界安全的2D GASF图像、管道泄漏检测的二维时频图以及开放数据集中的鲸鱼声音和地震活动等各类DAS信号。通过自我监督掩码重建任务捕捉DAS信号的深度语义特征。此外,本研究还采用视觉提示调优(Visual Prompt Tuning)方法进行下游识别任务。这种方法冻结了预训练的骨干参数,并仅对插入到Transformer编码层中的可学习视觉提示向量集进行微调。 在NVIDIA GeForce RTX 4080 Super平台上进行的实验验证表明,MAEPD模型能够在室内步态识别这一下游任务中表现出色。VPT-Deep方法仅使用少量参数(仅占总参数的0.322%)微调的情况下就达到了96.94%的分类精度,比传统的全量细调(Full Fine Tuning,FFT)方法高出0.61%,并且训练时间减少了45%。此外,该模型在管道泄漏检测任务中也展示了出色的性能,证明了MAEPD作为基础模型具备通用性、效率和可扩展性的特点。 这种方法为解决DAS领域信号识别模型泛化能力有限的问题提供了新的范式。
https://arxiv.org/abs/2508.04316
Robust gait recognition requires highly discriminative representations, which are closely tied to input modalities. While binary silhouettes and skeletons have dominated recent literature, these 2D representations fall short of capturing sufficient cues that can be exploited to handle viewpoint variations, and capture finer and meaningful details of gait. In this paper, we introduce a novel framework, termed DepthGait, that incorporates RGB-derived depth maps and silhouettes for enhanced gait recognition. Specifically, apart from the 2D silhouette representation of the human body, the proposed pipeline explicitly estimates depth maps from a given RGB image sequence and uses them as a new modality to capture discriminative features inherent in human locomotion. In addition, a novel multi-scale and cross-level fusion scheme has also been developed to bridge the modality gap between depth maps and silhouettes. Extensive experiments on standard benchmarks demonstrate that the proposed DepthGait achieves state-of-the-art performance compared to peer methods and attains an impressive mean rank-1 accuracy on the challenging datasets.
鲁棒的步态识别需要高度区分性的表示,这与输入模态密切相关。虽然二值轮廓和骨架在近期文献中占主导地位,但这些二维表示不足以捕捉足够的可用于处理视角变化的信息,并且也无法捕获步态中的细微而有意义的细节。为此,在本文中我们引入了一个新的框架,称为DepthGait,该框架结合了从RGB图像序列派生出的深度图和轮廓图,以增强步态识别能力。 具体来说,除了人类身体的传统二维轮廓表示之外,所提出的管道还能够直接从给定的RGB图像序列中估计深度图,并将其用作一种新的模态来捕获人体运动中的区分性特征。此外,我们还开发了一种新颖的多尺度和跨级融合方案,以弥合深度图与轮廓图之间的模式差距。 在标准基准测试上的广泛实验表明,提出的DepthGait框架相较于同类方法达到了最先进的性能,并且在一个具有挑战性的数据集上实现了令人印象深刻的平均准确率。
https://arxiv.org/abs/2508.03397
Current gait recognition methodologies generally necessitate retraining when encountering new datasets. Nevertheless, retrained models frequently encounter difficulties in preserving knowledge from previous datasets, leading to a significant decline in performance on earlier test sets. To tackle these challenges, we present a continual gait recognition task, termed GaitAdapt, which supports the progressive enhancement of gait recognition capabilities over time and is systematically categorized according to various evaluation scenarios. Additionally, we propose GaitAdapter, a non-replay continual learning approach for gait recognition. This approach integrates the GaitPartition Adaptive Knowledge (GPAK) module, employing graph neural networks to aggregate common gait patterns from current data into a repository constructed from graph vectors. Subsequently, this repository is used to improve the discriminability of gait features in new tasks, thereby enhancing the model's ability to effectively recognize gait patterns. We also introduce a Euclidean Distance Stability Method (EDSN) based on negative pairs, which ensures that newly added gait samples from different classes maintain similar relative spatial distributions across both previous and current gait tasks, thereby alleviating the impact of task changes on the distinguishability of original domain features. Extensive evaluations demonstrate that GaitAdapter effectively retains gait knowledge acquired from diverse tasks, exhibiting markedly superior discriminative capability compared to alternative methods.
当前的步态识别方法在遇到新的数据集时通常需要重新训练。然而,重新训练后的模型往往难以保持之前数据集中学到的知识,导致早期测试集上的性能显著下降。为了解决这些问题,我们提出了一项连续步态识别任务,称为GaitAdapt,该任务支持步态识别能力的逐步增强,并根据各种评估场景系统地进行分类。 此外,我们提出了一个名为GaitAdapter的方法,这是一种非重放持续学习方法,专门用于步态识别。此方法整合了GaitPartition自适应知识(GPAK)模块,利用图神经网络将当前数据中的通用步态模式聚合到由图向量构建的仓库中。随后,该仓库被用来提升新任务中步态特征的区别能力,从而增强模型有效识别步态模式的能力。 我们还引入了一种基于负样本对的欧氏距离稳定性方法(EDSN),以确保来自不同类别的新增加的步态样本在先前和当前的步态任务之间保持相似的空间分布关系。这样可以减轻任务变化对原始领域特征区分能力的影响。 广泛的评估表明,GaitAdapter能够有效保留从各种任务中获得的步态知识,并且与替代方法相比,在区别能力方面表现出显著的优势。
https://arxiv.org/abs/2508.03375
Gait recognition aims to identify individuals based on their body shape and walking patterns. Though much progress has been achieved driven by deep learning, gait recognition in real-world surveillance scenarios remains quite challenging to current methods. Conventional approaches, which rely on periodic gait cycles and controlled environments, struggle with the non-periodic and occluded silhouette sequences encountered in the wild. In this paper, we propose a novel framework, TrackletGait, designed to address these challenges in the wild. We propose Random Tracklet Sampling, a generalization of existing sampling methods, which strikes a balance between robustness and representation in capturing diverse walking patterns. Next, we introduce Haar Wavelet-based Downsampling to preserve information during spatial downsampling. Finally, we present a Hardness Exclusion Triplet Loss, designed to exclude low-quality silhouettes by discarding hard triplet samples. TrackletGait achieves state-of-the-art results, with 77.8 and 80.4 rank-1 accuracy on the Gait3D and GREW datasets, respectively, while using only 10.3M backbone parameters. Extensive experiments are also conducted to further investigate the factors affecting gait recognition in the wild.
步态识别旨在通过人体形状和行走模式来识别个体。尽管在深度学习的推动下已取得了显著进展,但在实际监控场景中进行步态识别仍然具有挑战性。传统的依赖于周期性的步行循环和受控环境的方法,在面对非周期性和被遮挡的身体轮廓序列时难以应对。 本文提出了一种新型框架 TrackletGait,旨在解决野外环境中遇到的这些挑战。我们提出了随机片段采样(Random Tracklet Sampling),这是一种现有采样方法的泛化版本,能够在捕捉多样化行走模式的同时保持鲁棒性与表现力之间的平衡。接下来,我们介绍了基于哈尔小波的下采样(Haar Wavelet-based Downsampling),该技术在空间下采样的过程中能够保存信息。最后,我们提出了一种难度排除三元组损失(Hardness Exclusion Triplet Loss),旨在通过丢弃难以处理的三元组样本来排除低质量的身体轮廓。 TrackletGait 实现了最先进的结果,在 Gait3D 和 GREW 数据集上分别达到了 77.8% 和 80.4% 的首位准确率,同时仅使用了10.3M个骨干参数。此外,还进行了广泛的实验以进一步探讨影响野外步态识别的因素。 通过这些创新性的方法和技术,TrackletGait 能够在实际监控场景中更有效地实现步态识别。
https://arxiv.org/abs/2508.02143
This work is completed on a whim after discussions with my junior colleague. The motion direction angle affects the micro-Doppler spectrum width, thus determining the human motion direction can provide important prior information for downstream tasks such as gait recognition. However, Doppler-Time map (DTM)-based methods still have room for improvement in achieving feature augmentation and motion determination simultaneously. In response, a low-cost but accurate radar-based human motion direction determination (HMDD) method is explored in this paper. In detail, the radar-based human gait DTMs are first generated, and then the feature augmentation is achieved using feature linking model. Subsequently, the HMDD is implemented through a lightweight and fast Vision Transformer-Convolutional Neural Network hybrid model structure. The effectiveness of the proposed method is verified through open-source dataset. The open-source code of this work is released at: this https URL.
这项工作是在与我的年轻同事讨论后一时兴起完成的。运动方向角影响微多普勒频谱宽度,因此确定人的运动方向可以为步态识别等下游任务提供重要的先验信息。然而,基于多普勒-时间图(DTM)的方法在同时实现特征增强和运动判断方面仍有改进空间。为此,本文探索了一种低成本但准确的雷达基人类运动方向判定(HMDD)方法。具体来说,首先生成基于雷达的人类步态DTM,然后使用特征链接模型实现特征增强。随后,通过轻量级且快速的视觉变换器-卷积神经网络混合模型结构实施HMDD。本文所提出的方法的有效性已在开源数据集上得到验证。本工作的开源代码发布在:此 https URL 链接。
https://arxiv.org/abs/2507.22567
Gait is becoming popular as a method of person re-identification because of its ability to identify people at a distance. However, most current works in gait recognition do not address the practical problem of occlusions. Among those which do, some require paired tuples of occluded and holistic sequences, which are impractical to collect in the real world. Further, these approaches work on occlusions but fail to retain performance on holistic inputs. To address these challenges, we propose RG-Gait, a method for residual correction for occluded gait recognition with holistic retention. We model the problem as a residual learning task, conceptualizing the occluded gait signature as a residual deviation from the holistic gait representation. Our proposed network adaptively integrates the learned residual, significantly improving performance on occluded gait sequences without compromising the holistic recognition accuracy. We evaluate our approach on the challenging Gait3D, GREW and BRIAR datasets and show that learning the residual can be an effective technique to tackle occluded gait recognition with holistic retention.
步态识别作为一种重新识别人员的方法正因其能够在远距离识别人的能力而变得越来越流行。然而,目前大多数关于步态识别的工作并没有解决实际中的遮挡问题。在少数研究遮挡问题的现有工作中,一些方法需要成对的遮挡序列和完整序列数据,这种需求在现实世界中难以获取。此外,这些方法虽然能处理遮挡情况,但在面对无遮挡的情况时性能表现不佳。 为了解决这些问题,我们提出了RG-Gait(残差矫正步态识别)这一方法,旨在解决遮挡下的步态识别问题,并保持对完整序列的识别准确性。我们将该问题视为一个残差学习任务,将被遮挡的步态特征看作是从完整的步态表示中偏离的一个残差。我们的网络能够自适应地整合所学得的残差信息,在不牺牲完全视图下步态识别准确性的前提下,显著提高对部分遮挡序列的识别性能。 我们在具有挑战性的Gait3D、GREW和BRIAR数据集上评估了我们提出的方法,并展示了学习残差可以作为一种有效的技术来处理带有完整视图保留能力的被遮挡步态识别问题。
https://arxiv.org/abs/2507.10978
To capture individual gait patterns, excluding identity-irrelevant cues in walking videos, such as clothing texture and color, remains a persistent challenge for vision-based gait recognition. Traditional silhouette- and pose-based methods, though theoretically effective at removing such distractions, often fall short of high accuracy due to their sparse and less informative inputs. Emerging end-to-end methods address this by directly denoising RGB videos using human priors. Building on this trend, we propose DenoisingGait, a novel gait denoising method. Inspired by the philosophy that "what I cannot create, I do not understand", we turn to generative diffusion models, uncovering how they partially filter out irrelevant factors for gait understanding. Additionally, we introduce a geometry-driven Feature Matching module, which, combined with background removal via human silhouettes, condenses the multi-channel diffusion features at each foreground pixel into a two-channel direction vector. Specifically, the proposed within- and cross-frame matching respectively capture the local vectorized structures of gait appearance and motion, producing a novel flow-like gait representation termed Gait Feature Field, which further reduces residual noise in diffusion features. Experiments on the CCPG, CASIA-B*, and SUSTech1K datasets demonstrate that DenoisingGait achieves a new SoTA performance in most cases for both within- and cross-domain evaluations. Code is available at this https URL.
要捕捉个体步态模式,排除走路视频中的身份无关线索(如衣物纹理和颜色),对于基于视觉的步态识别来说一直是一项挑战。传统的轮廓和姿势方法虽然理论上能有效去除这些干扰因素,但由于其稀疏且信息量较少的输入,在实际应用中往往难以达到高精度。新兴的端到端方法通过直接利用人体先验知识对RGB视频进行去噪来解决这一问题。在此基础上,我们提出了DenoisingGait,一种新型步态去噪方法。 受“我无法创造的东西我不理解”的哲学思想启发,我们将注意力转向生成扩散模型,并发现这些模型部分地过滤出了与步态理解无关的因素。此外,我们还引入了一个几何驱动的特征匹配模块,在通过人体轮廓移除背景后,将多通道扩散特征在每个前景像素处压缩成两个方向向量。 具体来说,提出的帧内和跨帧匹配分别捕捉了步态外观和运动的局部矢量化结构,生成了一种新的流式步态表示,称为步态特征场(Gait Feature Field),进一步减少了扩散特征中的残余噪声。在CCPG、CASIA-B* 和 SUSTech1K 数据集上的实验表明,在域内和跨域评估中,DenoisingGait 在大多数情况下实现了新的最佳性能。代码可在提供的链接中获取。
https://arxiv.org/abs/2505.18582