Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.
步态识别是一种用于安全应用的非侵入性生物特征技术,但现有的研究主要集中在轮廓和解析表示上。轮廓是稀疏的,并且缺少内部结构细节,这限制了其鉴别能力;而解析则通过引入部分级结构来丰富轮廓,不过它严重依赖于上游的人体解析器(例如标签粒度和边界精度),导致在不同数据集上的表现不稳定,在某些情况下甚至比单纯的轮廓识别效果更差。我们从结构角度重新审视步态表示,并定义了一个由边缘密度和监督形式构成的设计空间:轮廓使用稀疏的边界边缘并带有弱单标签监督,而解析则利用更密集的线索以及强大的语义先验。在这个设计空间中,我们发现了一种被忽视的方法:没有显式的语义标签但具有密集的部分级结构,并引入SKETCH作为一种新的步态识别视觉模式。通过基于边缘检测器从RGB图像直接提取高频结构线索(例如肢体关节和自我遮挡轮廓),Sketch在无标签的情况下工作。我们进一步证明,引导式解析与非引导式草图在语义上是解耦的,在结构上是互补的。基于此,我们提出了SKETCHGAIT,这是一个分层解缠的多模态框架,包括两个独立的学习流和一个轻量级的早期融合分支来捕捉结构上的互补性。在SUSTech1K和CCPG数据集上的广泛实验验证了所提出的模式和框架的有效性:SketchGait在SUSTech1K上达到了92.9%的第一名准确率,并且在CCPG上实现了93.1%的平均第一名准确率。
https://arxiv.org/abs/2603.05537
Human identification at a distance (HID) is challenging because traditional biometric modalities such as face and fingerprints are often difficult to acquire in real-world scenarios. Gait recognition provides a practical alternative, as it can be captured reliably at a distance. To promote progress in gait recognition and provide a fair evaluation platform, the International Competition on Human Identification at a Distance (HID) has been organized annually since 2020. Since 2023, the competition has adopted the challenging SUSTech-Competition dataset, which features substantial variations in clothing, carried objects, and view angles. No dedicated training data are provided, requiring participants to train their models using external datasets. Each year, the competition applies a different random seed to generate distinct evaluation splits, which reduces the risk of overfitting and supports a fair assessment of cross-domain generalization. While HID 2023 and HID 2024 already used this dataset, HID 2025 explicitly examined whether algorithmic advances could surpass the accuracy limits observed previously. Despite the heightened difficulty, participants achieved further improvements, and the best-performing method reached 94.2% accuracy, setting a new benchmark on this dataset. We also analyze key technical trends and outline potential directions for future research in gait recognition.
远距离人类识别(HID)具有挑战性,因为传统的生物特征模态如面部和指纹在现实场景中往往难以采集。步态识别提供了一个实用的替代方案,因为它可以在较远的距离可靠地捕捉到个体的行走方式。为了促进步态识别的进步并为该领域提供一个公平的评估平台,国际远距离人类识别竞赛(HID)自2020年起每年举办一次。从2023年开始,比赛采用了具有挑战性的SUSTech-Competition数据集,该数据集包含了服装、携带物品和视角角度方面的显著变化。由于没有提供专门用于训练的数据集,参赛者必须使用外部数据集来训练他们的模型。竞赛每年应用不同的随机种子生成不同的评估分组,这减少了过拟合的风险,并支持跨域泛化的公平评价。尽管HID 2023和HID 2024已经采用了这个数据集,但HID 2025明确地检查了算法进步是否可以超过先前观察到的准确性限制。即使难度增加,参赛者仍然取得了进一步的进步,最优秀的方法达到了94.2%的准确率,在该数据集上设立了新的基准。我们还分析了关键技术趋势,并概述了步态识别未来研究的潜在方向。
https://arxiv.org/abs/2602.07565
In skeleton-based human activity understanding, existing methods often adopt the contrastive learning paradigm to construct a discriminative feature space. However, many of these approaches fail to exploit the structural inter-class similarities and overlook the impact of anomalous positive samples. In this study, we introduce ACLNet, an Affinity Contrastive Learning Network that explores the intricate clustering relationships among human activity classes to improve feature discrimination. Specifically, we propose an affinity metric to refine similarity measurements, thereby forming activity superclasses that provide more informative contrastive signals. A dynamic temperature schedule is also introduced to adaptively adjust the penalty strength for various superclasses. In addition, we employ a margin-based contrastive strategy to improve the separation of hard positive and negative samples within classes. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, Kinetics-Skeleton, PKU-MMD, FineGYM, and CASIA-B demonstrate the superiority of our method in skeleton-based action recognition, gait recognition, and person re-identification. The source code is available at this https URL.
https://arxiv.org/abs/2601.16694
Gait recognition is emerging as a promising technology and an innovative field within computer vision. However, existing methods typically rely on complex architectures to directly extract features from images and apply pooling operations to obtain sequence-level representations. Such designs often lead to overfitting on static noise (e.g., clothing), while failing to effectively capture dynamic motion this http URL address the above challenges, we present a Language guided and Motion-aware gait recognition framework, named this http URL particular, we utilize designed gait-related language cues to capture key motion features in gait sequences.
步态识别正逐渐成为计算机视觉领域中一项有前景的技术和创新领域。然而,现有的方法通常依赖于复杂的架构直接从图像中提取特征,并应用池化操作来获取序列级别的表示。这样的设计往往会导致过度拟合静态噪声(例如服装),而无法有效捕捉动态运动信息。为了解决上述挑战,我们提出了一种语言引导且注重运动的步态识别框架,命名为LangMoGR。 具体而言,我们利用了专门设计的步态相关语言线索来捕捉步态序列中的关键运动特征。
https://arxiv.org/abs/2601.11931
The goal of gait recognition is to extract identity-invariant features of an individual under various gait conditions, e.g., cross-view and cross-clothing. Most gait models strive to implicitly learn the common traits across different gait conditions in a data-driven manner to pull different gait conditions closer for recognition. However, relatively few studies have explicitly explored the inherent relations between different gait conditions. For this purpose, we attempt to establish connections among different gait conditions and propose a new perspective to achieve gait recognition: variations in different gait conditions can be approximately viewed as a combination of geometric transformations. In this case, all we need is to determine the types of geometric transformations and achieve geometric invariance, then identity invariance naturally follows. As an initial attempt, we explore three common geometric transformations (i.e., Reflect, Rotate, and Scale) and design a $\mathcal{R}$eflect-$\mathcal{R}$otate-$\mathcal{S}$cale invariance learning framework, named ${\mathcal{RRS}}$-Gait. Specifically, it first flexibly adjusts the convolution kernel based on the specific geometric transformations to achieve approximate feature equivariance. Then these three equivariant-aware features are respectively fed into a global pooling operation for final invariance-aware learning. Extensive experiments on four popular gait datasets (Gait3D, GREW, CCPG, SUSTech1K) show superior performance across various gait conditions.
步态识别的目标是从不同步态条件下提取个体的身份不变特征,例如跨视角和跨服装条件。大多数步态模型致力于通过数据驱动的方式隐式学习不同步态条件下的共同特性,以使不同的步态条件在识别时更加接近。然而,相对较少的研究明确探讨了不同步态条件之间的内在关系。为此,我们试图建立不同步态条件之间的联系,并提出了一种新的实现步态识别的视角:不同步态条件的变化可以近似视为几何变换的组合。在这种情况下,我们需要确定各种几何变换类型并实现几何不变性,则身份不变性自然随之而来。 作为初步尝试,我们探索了三种常见的几何变换(即反射、旋转和缩放),并设计了一个名为$\mathcal{RRS}$-Gait的反射-$\mathcal{R}$otate-$\mathcal{S}$cale不变性学习框架。具体来说,它首先根据特定的几何变换灵活调整卷积核以实现近似的特征等变性(equivariance)。然后将这三种具有等变性的特征分别输入全局池化操作进行最终的不变性感知学习。 在四个流行的步态数据集(Gait3D、GREW、CCPG和SUSTech1K)上进行了广泛的实验,结果表明该方法在各种步态条件下均表现出优异性能。
https://arxiv.org/abs/2601.05604
Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL's ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.
人类动作理解通过基于视觉的进步在识别、跟踪和描述方面取得了迅速的发展。然而,大多数现有的方法忽视了诸如关节作用力等生物力学中基本的物理线索。这一差距激发了我们的研究:物理推断出的力量会在何时何地增强对运动的理解?通过将力量融入到已建立的动作理解流程中,我们系统性地评估了它们在三个主要任务上的影响:步态识别、动作识别和细粒度视频描述。在八个基准测试上,引入力量后性能均有持续提升;例如,在CASIA-B数据集上,Rank-1步态识别准确率从89.52%提高到90.39%(+0.87),尤其是在挑战条件下观察到了更大的增益:穿外套时提高了2.7%,侧面视角下提高了3.0%。在Gait3D数据集上,性能也由46.0%提升至47.3%(+1.3)。在动作识别任务中,CTR-GCN模型在Penn Action上的表现提升了2.00%,而高耗力的动作类别如拳击/拍打则获得了高达6.96%的改进。即使是在视频描述领域,Qwen2.5-VL的ROUGE-L评分也从0.310上升至0.339(+0.029),表明物理推断的力量增强了时间定位和语义丰富度。 这些结果证明,在动态、被遮挡或外观变化的情况下,力线索可以显著补充视觉和运动学特征。
https://arxiv.org/abs/2512.20451
Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.
生成式人工智能(GenAI)模型已经革新了动画领域,能够以惊人的视觉逼真度合成人类形象和动作模式。然而,生成真正逼真的真人动画仍然是一个艰巨的挑战,即使是细微的不一致性也会使角色显得不自然。当评估行为生物识别时,这一限制尤为关键,在这种情况下,定义身份的微妙运动线索很容易丢失或失真。本研究调查了最先进的GenAI人体动画模型能否在步态生物识别中保留用于人员识别所需的微小时空细节。具体而言,我们针对四个不同的GenAI模型进行两项主要评估任务来考察其能力:一是从参考视频中恢复步态模式,在不同复杂度条件下;二是将这些步态模式转移到不同的视觉身份上。我们的结果显示,虽然视觉质量大多很高,但在人员识别任务中生物特征的准确性仍然很低,这表明目前的GenAI模型在解耦身份和运动方面存在困难。此外,通过一项身份转换任务,我们揭示了基于外观的步态识别中的一个基本缺陷:当纹理从运动中分离出来时,识别效果会崩溃,证明当前的GenAI模型依赖于视觉属性而非时间动态。
https://arxiv.org/abs/2512.19275
Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:this https URL
准确且可解释的步态分析在帕金森病(PD)的早期检测中起着关键作用,然而大多数现有的方法仍然受到单模态输入、低鲁棒性和临床透明度不足的限制。本文提出了一种基于RGB和深度(RGB-D)数据的可解释多模态框架,旨在识别现实条件下帕金森步态模式。 该系统采用两个基于YOLOv11的编码器来提取特定于每种模态的特征,并随后通过一个多尺度局部-全局提取(MLGE)模块及跨空间颈部融合机制增强了时空表示。这种设计即使在光照不足或衣物遮挡等挑战性场景下,也能捕捉到细微的肢体运动(如减少的手臂摆动)和整体步态动态(如短步长或多转困难)。为了确保可解释性,在该框架中整合了一个冻结的大规模语言模型(LLM),用于将融合后的视觉嵌入与结构化元数据转换为具有临床意义的文字说明。 在多模态步态数据集上的实验评估表明,所提出的RGB-D融合框架相比单输入基线方法实现了更高的识别精度、更强的环境变化鲁棒性和更清晰的视听语言推理。通过结合多模式特征学习和基于语言的理解能力,这项研究弥合了视觉识别与临床理解之间的差距,并为可靠的可解释帕金森步态分析提供了一种新颖的视觉-语言范式。 代码链接:请访问提供的URL查看相关代码。
https://arxiv.org/abs/2512.04425
Gait patterns play a critical role in human identification and healthcare analytics, yet current progress remains constrained by small, narrowly designed models that fail to scale or generalize. Building a unified gait foundation model requires addressing two longstanding barriers: (a) Scalability. Why have gait models historically failed to follow scaling laws? (b) Generalization. Can one model serve the diverse gait tasks that have traditionally been studied in isolation? We introduce FoundationGait, the first scalable, self-supervised pretraining framework for gait understanding. Its largest version has nearly 0.13 billion parameters and is pretrained on 12 public gait datasets comprising over 2 million walking sequences. Extensive experiments demonstrate that FoundationGait, with or without fine-tuning, performs robustly across a wide spectrum of gait datasets, conditions, tasks (e.g., human identification, scoliosis screening, depression prediction, and attribute estimation), and even input modality. Notably, it achieves 48.0% zero-shot rank-1 accuracy on the challenging in-the-wild Gait3D dataset (1,000 test subjects) and 64.5% on the largest in-the-lab OU-MVLP dataset (5,000+ test subjects), setting a new milestone in robust gait recognition. Coming code and model: this https URL.
步态模式在人类识别和医疗数据分析中扮演着关键角色,然而当前的进展受到小型且设计狭窄的模型限制,这些模型无法扩展或泛化。构建一个统一的步态基础模型需要克服两个长期存在的障碍:(a)可扩展性。为什么步态模型历史上未能遵循规模法则?(b)泛化能力。是否有一种单一模型能够服务于传统上孤立研究的各种步态任务? 我们介绍了FoundationGait,这是首个用于步态理解的大规模自监督预训练框架。其最大版本包含近0.13亿个参数,并在包括超过200万步行序列的12个公开步态数据集上进行过预训练。 广泛的实验表明,在有或没有微调的情况下,FoundationGait在各种步态数据集、条件、任务(例如人类识别、脊柱侧弯筛查、抑郁症预测和属性估计)以及输入模式下均表现出色。尤为值得注意的是,在具有挑战性的野外环境下的Gait3D数据集中(1,000个测试对象),它实现了48.0%的零样本Rank-1准确率,并在实验室内的OU-MVLP数据集(超过5,000名测试对象)中达到了64.5%,为稳健的步态识别树立了新的里程碑。 代码和模型:请访问此链接 [this https URL] 获取更多信息。
https://arxiv.org/abs/2512.00691
Appearance-based gait recognition have achieved strong performance on controlled datasets, yet systematic evaluation of its robustness to real-world corruptions and silhouette variability remains lacking. We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems. RobustGait evaluation spans four dimensions: the type of perturbation (digital, environmental, temporal, occlusion), the silhouette extraction method (segmentation and parsing networks), the architectural capacities of gait recognition models, and various deployment scenarios. The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems. We came across several exciting insights. First, applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems. Second, gait accuracy is highly sensitive to silhouette extractor biases, revealing an overlooked source of benchmark bias. Third, robustness is dependent on both the type of perturbation and the architectural design. Finally, we explore robustness-enhancing strategies, showing that noise-aware training and knowledge distillation improve performance and move toward deployment-ready systems.
https://arxiv.org/abs/2511.13065
Gait recognition offers a non-intrusive biometric solution by identifying individuals through their walking patterns. Although discriminative models have achieved notable success in this domain, the full potential of generative models remains largely underexplored. In this paper, we introduce \textbf{CoD$^2$}, a novel framework that combines the data distribution modeling capabilities of diffusion models with the semantic representation learning strengths of discriminative models to extract robust gait features. We propose a Multi-level Conditional Control strategy that incorporates both high-level identity-aware semantic conditions and low-level visual details. Specifically, the high-level condition, extracted by the discriminative extractor, guides the generation of identity-consistent gait sequences, whereas low-level visual details, such as appearance and motion, are preserved to enhance consistency. Furthermore, the generated sequences facilitate the discriminative extractor's learning, enabling it to capture more comprehensive high-level semantic features. Extensive experiments on four datasets (SUSTech1K, CCPG, GREW, and Gait3D) demonstrate that CoD$^2$ achieves state-of-the-art performance and can be seamlessly integrated with existing discriminative methods, yielding consistent improvements.
https://arxiv.org/abs/2511.06245
Deep learning-based gait recognition has achieved great success in various applications. The key to accurate gait recognition lies in considering the unique and diverse behavior patterns in different motion regions, especially when covariates affect visual appearance. However, existing methods typically use predefined regions for temporal modeling, with fixed or equivalent temporal scales assigned to different types of regions, which makes it difficult to model motion regions that change dynamically over time and adapt to their specific patterns. To tackle this problem, we introduce a Region-aware Dynamic Aggregation and Excitation framework (GaitRDAE) that automatically searches for motion regions, assigns adaptive temporal scales and applies corresponding attention. Specifically, the framework includes two core modules: the Region-aware Dynamic Aggregation (RDA) module, which dynamically searches the optimal temporal receptive field for each region, and the Region-aware Dynamic Excitation (RDE) module, which emphasizes the learning of motion regions containing more stable behavior patterns while suppressing attention to static regions that are more susceptible to covariates. Experimental results show that GaitRDAE achieves state-of-the-art performance on several benchmark datasets.
基于深度学习的步态识别在各种应用中取得了巨大成功。准确进行步态识别的关键在于考虑不同运动区域的独特和多样化的行为模式,特别是在协变量影响视觉外观的情况下。然而,现有的方法通常使用预定义的区域来进行时间建模,并且给不同类型区域分配固定或等同的时间尺度,这使得难以对随时间动态变化的运动区域及其特定模式进行建模与适应。 为了解决这个问题,我们引入了一个基于区域感知的动态聚合和激励框架(GaitRDAE),该框架能够自动搜索运动区域、分配自适应的时间尺度并应用相应的注意力机制。具体来说,该框架包括两个核心模块:区域感知的动态聚合(RDA)模块,它为每个区域动态地寻找最优的时间感受野;以及区域感知的动态激励(RDE)模块,该模块强调学习包含更稳定行为模式的运动区域,并抑制对更容易受到协变量影响的静态区域的关注。 实验结果显示,GaitRDAE在几个基准数据集上达到了最先进的性能。
https://arxiv.org/abs/2510.16541
Gait recognition is an important biometric for human identification at a distance, particularly under low-resolution or unconstrained environments. Current works typically focus on either 2D representations (e.g., silhouettes and skeletons) or 3D representations (e.g., meshes and SMPLs), but relying on a single modality often fails to capture the full geometric and dynamic complexity of human walking patterns. In this paper, we propose a multi-modal and multi-task framework that combines 2D temporal silhouettes with 3D SMPL features for robust gait analysis. Beyond identification, we introduce a multitask learning strategy that jointly performs gait recognition and human attribute estimation, including age, body mass index (BMI), and gender. A unified transformer is employed to effectively fuse multi-modal gait features and better learn attribute-related representations, while preserving discriminative identity cues. Extensive experiments on the large-scale BRIAR datasets, collected under challenging conditions such as long-range distances (up to 1 km) and extreme pitch angles (up to 50°), demonstrate that our approach outperforms state-of-the-art methods in gait recognition and provides accurate human attribute estimation. These results highlight the promise of multi-modal and multitask learning for advancing gait-based human understanding in real-world scenarios.
步态识别是一种重要的生物特征,用于在远距离下进行人体识别,尤其是在低分辨率或非约束环境中。当前的研究通常集中在二维表示(如轮廓和骨架)或三维表示(如网格和SMPL模型)上,但依赖单一模态往往无法捕捉到人类行走模式中的完整几何和动态复杂性。在这篇论文中,我们提出了一种多模态、多任务框架,该框架结合了2D时间轮廓与3D SMPL特征,以实现稳健的步态分析。除了识别功能之外,我们还引入了一种多任务学习策略,可以同时执行步态识别和人体属性估计(包括年龄、体质指数BMI以及性别)。一个统一的变压器被用来有效地融合多种模态下的步态特征,并更好地学习与属性相关的表示,同时保留了区分性的身份线索。在大规模BRIAR数据集上进行了广泛的实验,该数据集在具有挑战性条件下收集的数据,例如长距离(高达1公里)和极端俯仰角度(高达50°)。实验结果表明,我们的方法在步态识别方面超越了最先进的方法,并且能够提供准确的人体属性估计。这些成果突显了多模态与多任务学习对于推进基于步态的人类理解在现实世界场景中的应用前景。
https://arxiv.org/abs/2510.10417
Gait recognition, a fundamental biometric technology, leverages unique walking patterns for individual identification, typically using 2D representations such as silhouettes or skeletons. However, these methods often struggle with viewpoint variations, occlusions, and noise. Multi-modal approaches that incorporate 3D body shape information offer improved robustness but are computationally expensive, limiting their feasibility for real-time applications. To address these challenges, we introduce Mesh-Gait, a novel end-to-end multi-modal gait recognition framework that directly reconstructs 3D representations from 2D silhouettes, effectively combining the strengths of both modalities. Compared to existing methods, directly learning 3D features from 3D joints or meshes is complex and difficult to fuse with silhouette-based gait features. To overcome this, Mesh-Gait reconstructs 3D heatmaps as an intermediate representation, enabling the model to effectively capture 3D geometric information while maintaining simplicity and computational efficiency. During training, the intermediate 3D heatmaps are gradually reconstructed and become increasingly accurate under supervised learning, where the loss is calculated between the reconstructed 3D joints, virtual markers, and 3D meshes and their corresponding ground truth, ensuring precise spatial alignment and consistent 3D structure. Mesh-Gait extracts discriminative features from both silhouettes and reconstructed 3D heatmaps in a computationally efficient manner. This design enables the model to capture spatial and structural gait characteristics while avoiding the heavy overhead of direct 3D reconstruction from RGB videos, allowing the network to focus on motion dynamics rather than irrelevant visual details. Extensive experiments demonstrate that Mesh-Gait achieves state-of-the-art accuracy. The code will be released upon acceptance of the paper.
步态识别是一种基本的生物识别技术,利用独特的行走模式进行个体身份识别,通常采用二维表示方法,如轮廓或骨架。然而,这些方法在面对视角变化、遮挡和噪声时往往表现不佳。多模态方法结合3D身体形状信息可以提高鲁棒性,但计算成本较高,限制了其在实时应用中的可行性。为了解决这些问题,我们提出了一种新颖的端到端多模态步态识别框架Mesh-Gait,该框架直接从2D轮廓重建三维表示,有效地融合了两种模式的优点。 与现有的方法相比,直接从3D关节或网格中学习3D特征既复杂又难以与基于轮廓的步态特征相融合。为了克服这一难题,Mesh-Gait 重构了3D热图作为中间表示形式,使模型能够有效捕捉三维几何信息,同时保持简洁性和计算效率。 在训练过程中,在监督学习下,通过计算重建的3D关节、虚拟标记和3D网格与其对应的地面真实值之间的损失来逐步重构中间的3D热图,并且这些热图会变得越来越准确。这确保了精确的空间对齐和一致的三维结构。 Mesh-Gait 能够以计算高效的方式从轮廓和重建的3D热图中提取判别性特征,这种设计使模型能够捕捉到步态的空间和结构性质,同时避免直接从RGB视频中进行复杂的三维重构带来的沉重开销。这样网络可以专注于运动动态而非无关视觉细节。 广泛的实验表明,Mesh-Gait 达到了最先进的精度水平。论文一经接受,代码将公开发布。
https://arxiv.org/abs/2510.10406
Gait recognition is a valuable biometric task that enables the identification of individuals from a distance based on their walking patterns. However, it remains limited by the lack of large-scale labeled datasets and the difficulty of collecting diverse gait samples for each individual while preserving privacy. To address these challenges, we propose GaitCrafter, a diffusion-based framework for synthesizing realistic gait sequences in the silhouette domain. Unlike prior works that rely on simulated environments or alternative generative models, GaitCrafter trains a video diffusion model from scratch, exclusively on gait silhouette data. Our approach enables the generation of temporally consistent and identity-preserving gait sequences. Moreover, the generation process is controllable-allowing conditioning on various covariates such as clothing, carried objects, and view angle. We show that incorporating synthetic samples generated by GaitCrafter into the gait recognition pipeline leads to improved performance, especially under challenging conditions. Additionally, we introduce a mechanism to generate novel identities-synthetic individuals not present in the original dataset-by interpolating identity embeddings. These novel identities exhibit unique, consistent gait patterns and are useful for training models while maintaining privacy of real subjects. Overall, our work takes an important step toward leveraging diffusion models for high-quality, controllable, and privacy-aware gait data generation.
步态识别是一种有价值的生物特征任务,通过个体的行走模式从远处识别个人。然而,这一领域受限于大规模标注数据集的缺乏以及在保护隐私的同时收集每位个体多样化步态样本的难度。为了应对这些挑战,我们提出了GaitCrafter,这是一种基于扩散模型的框架,用于在轮廓域中合成逼真的步态序列。与之前依赖于模拟环境或替代生成模型的方法不同,GaitCrafter从头开始仅使用步态轮廓数据训练视频扩散模型。 我们的方法能够生成时间上一致且能保持身份特征的步态序列。此外,生成过程是可控制的——可以依据各种协变量(如穿着的衣服、携带的物体和视角角度)进行条件设置。我们证明了将GaitCrafter生成的合成样本加入到步态识别流程中能够提升性能,特别是在面对挑战性情况时。 此外,我们还引入了一种机制来生成新身份——在原始数据集中不存在的合成个体,通过插值身份嵌入实现这一目标。这些新的身份具有独特的、一致的步态模式,并且对于在保护实际参与者的隐私的同时训练模型非常有用。 总的来说,我们的工作为利用扩散模型进行高质量、可控和注重隐私的步态数据生成迈出了重要一步。
https://arxiv.org/abs/2508.13300
Recent advancements in gait recognition have significantly enhanced performance by treating silhouettes as either an unordered set or an ordered sequence. However, both set-based and sequence-based approaches exhibit notable limitations. Specifically, set-based methods tend to overlook short-range temporal context for individual frames, while sequence-based methods struggle to capture long-range temporal dependencies effectively. To address these challenges, we draw inspiration from human identification and propose a new perspective that conceptualizes human gait as a composition of individualized actions. Each action is represented by a series of frames, randomly selected from a continuous segment of the sequence, which we term a snippet. Fundamentally, the collection of snippets for a given sequence enables the incorporation of multi-scale temporal context, facilitating more comprehensive gait feature learning. Moreover, we introduce a non-trivial solution for snippet-based gait recognition, focusing on Snippet Sampling and Snippet Modeling as key components. Extensive experiments on four widely-used gait datasets validate the effectiveness of our proposed approach and, more importantly, highlight the potential of gait snippets. For instance, our method achieves the rank-1 accuracy of 77.5% on Gait3D and 81.7% on GREW using a 2D convolution-based backbone.
近期在步态识别领域的进展显著提升了性能,通过将轮廓视作无序集合或有序序列来处理。然而,基于集的方法和基于序列的方法都存在明显的局限性。具体而言,基于集的方法倾向于忽略单个帧的短时上下文信息,而基于序列的方法则难以有效捕捉长时依赖关系。为了解决这些挑战,我们从人类识别中汲取灵感,并提出了一种新的视角,即将人体步态视为个体化动作组合的概念。每个动作由一系列随机选取自连续片段中的帧表示,我们将这种片段称为“snippet(剪辑)”。从根本上说,针对给定序列收集的片段集合能够整合多尺度的时间上下文信息,从而促进更全面的步态特征学习。 此外,我们还提出了一种基于片段的步态识别非平凡解决方案,着重于片段采样和片段建模作为关键组成部分。在四个广泛使用的步态数据集上进行的大量实验验证了所提方法的有效性,并且更重要的是突显了步态剪辑的应用潜力。例如,在使用2D卷积为基础模型的情况下,我们的方法在Gait3D数据集中达到了77.5%的首位准确率,在GREW数据集中则达到了81.7%的首位准确率。
https://arxiv.org/abs/2508.07782
Distributed Acoustic Sensing (DAS) technology finds growing applications across various domains. However, data distribution disparities due to heterogeneous sensing environments pose challenges for data-driven artificial intelligence (AI) models, limiting cross-domain generalization and facing a shortage of labeled training data. To address these issues, this study proposes a foundational model for DAS signal recognition based on a Masked Autoencoder, named MAEPD. The MAEPD model is pretrained on a dataset of 635,860 samples, encompassing DAS gait spatiotemporal signals, 2D GASF images for perimeter security, 2D time-frequency images for pipeline leakage, and open-dataset signals including whale vocalizations and seismic activities, using a self-supervised mask reconstruction task to capture deep semantic features of DAS signals. Visual Prompt Tuning (VPT) is employed for downstream recognition tasks. This method freezes the pretrained backbone parameters and fine-tunes only a small set of learnable visual prompt vectors inserted into the Transformer encoder layers. Experiments on the NVIDIA GeForce RTX 4080 Super platform validate MAEPD using indoor gait recognition as a downstream task. The VPT-Deep approach achieves a classification accuracy of 96.94% with just 0.322% of parameters fine-tuned, surpassing the traditional Full Fine Tuning (FFT) method by 0.61% and reducing training time by 45%. The model also exhibits robust performance in pipeline leakage detection, confirming the generality, efficiency, and scalability of MAEPD as a foundational model. This approach offers a novel paradigm for addressing the limited generalization of signal recognition models in the DAS domain.
分布式声学传感(Distributed Acoustic Sensing,简称DAS)技术在多个领域中的应用日益广泛。然而,在不同异构传感环境中产生的数据分布差异对依赖于大量标记训练数据的数据驱动人工智能模型构成了挑战,限制了这些模型的跨域泛化能力并导致标注训练数据短缺的问题。 为了应对这些问题,本研究提出了一种基于掩码自动编码器(Masked Autoencoder)的基础模型MAEPD用于DAS信号识别。该MAEPD模型在包含635,860个样本的数据集上进行预训练,这些样本包括步态时空信号、周界安全的2D GASF图像、管道泄漏检测的二维时频图以及开放数据集中的鲸鱼声音和地震活动等各类DAS信号。通过自我监督掩码重建任务捕捉DAS信号的深度语义特征。此外,本研究还采用视觉提示调优(Visual Prompt Tuning)方法进行下游识别任务。这种方法冻结了预训练的骨干参数,并仅对插入到Transformer编码层中的可学习视觉提示向量集进行微调。 在NVIDIA GeForce RTX 4080 Super平台上进行的实验验证表明,MAEPD模型能够在室内步态识别这一下游任务中表现出色。VPT-Deep方法仅使用少量参数(仅占总参数的0.322%)微调的情况下就达到了96.94%的分类精度,比传统的全量细调(Full Fine Tuning,FFT)方法高出0.61%,并且训练时间减少了45%。此外,该模型在管道泄漏检测任务中也展示了出色的性能,证明了MAEPD作为基础模型具备通用性、效率和可扩展性的特点。 这种方法为解决DAS领域信号识别模型泛化能力有限的问题提供了新的范式。
https://arxiv.org/abs/2508.04316
Robust gait recognition requires highly discriminative representations, which are closely tied to input modalities. While binary silhouettes and skeletons have dominated recent literature, these 2D representations fall short of capturing sufficient cues that can be exploited to handle viewpoint variations, and capture finer and meaningful details of gait. In this paper, we introduce a novel framework, termed DepthGait, that incorporates RGB-derived depth maps and silhouettes for enhanced gait recognition. Specifically, apart from the 2D silhouette representation of the human body, the proposed pipeline explicitly estimates depth maps from a given RGB image sequence and uses them as a new modality to capture discriminative features inherent in human locomotion. In addition, a novel multi-scale and cross-level fusion scheme has also been developed to bridge the modality gap between depth maps and silhouettes. Extensive experiments on standard benchmarks demonstrate that the proposed DepthGait achieves state-of-the-art performance compared to peer methods and attains an impressive mean rank-1 accuracy on the challenging datasets.
鲁棒的步态识别需要高度区分性的表示,这与输入模态密切相关。虽然二值轮廓和骨架在近期文献中占主导地位,但这些二维表示不足以捕捉足够的可用于处理视角变化的信息,并且也无法捕获步态中的细微而有意义的细节。为此,在本文中我们引入了一个新的框架,称为DepthGait,该框架结合了从RGB图像序列派生出的深度图和轮廓图,以增强步态识别能力。 具体来说,除了人类身体的传统二维轮廓表示之外,所提出的管道还能够直接从给定的RGB图像序列中估计深度图,并将其用作一种新的模态来捕获人体运动中的区分性特征。此外,我们还开发了一种新颖的多尺度和跨级融合方案,以弥合深度图与轮廓图之间的模式差距。 在标准基准测试上的广泛实验表明,提出的DepthGait框架相较于同类方法达到了最先进的性能,并且在一个具有挑战性的数据集上实现了令人印象深刻的平均准确率。
https://arxiv.org/abs/2508.03397
Current gait recognition methodologies generally necessitate retraining when encountering new datasets. Nevertheless, retrained models frequently encounter difficulties in preserving knowledge from previous datasets, leading to a significant decline in performance on earlier test sets. To tackle these challenges, we present a continual gait recognition task, termed GaitAdapt, which supports the progressive enhancement of gait recognition capabilities over time and is systematically categorized according to various evaluation scenarios. Additionally, we propose GaitAdapter, a non-replay continual learning approach for gait recognition. This approach integrates the GaitPartition Adaptive Knowledge (GPAK) module, employing graph neural networks to aggregate common gait patterns from current data into a repository constructed from graph vectors. Subsequently, this repository is used to improve the discriminability of gait features in new tasks, thereby enhancing the model's ability to effectively recognize gait patterns. We also introduce a Euclidean Distance Stability Method (EDSN) based on negative pairs, which ensures that newly added gait samples from different classes maintain similar relative spatial distributions across both previous and current gait tasks, thereby alleviating the impact of task changes on the distinguishability of original domain features. Extensive evaluations demonstrate that GaitAdapter effectively retains gait knowledge acquired from diverse tasks, exhibiting markedly superior discriminative capability compared to alternative methods.
当前的步态识别方法在遇到新的数据集时通常需要重新训练。然而,重新训练后的模型往往难以保持之前数据集中学到的知识,导致早期测试集上的性能显著下降。为了解决这些问题,我们提出了一项连续步态识别任务,称为GaitAdapt,该任务支持步态识别能力的逐步增强,并根据各种评估场景系统地进行分类。 此外,我们提出了一个名为GaitAdapter的方法,这是一种非重放持续学习方法,专门用于步态识别。此方法整合了GaitPartition自适应知识(GPAK)模块,利用图神经网络将当前数据中的通用步态模式聚合到由图向量构建的仓库中。随后,该仓库被用来提升新任务中步态特征的区别能力,从而增强模型有效识别步态模式的能力。 我们还引入了一种基于负样本对的欧氏距离稳定性方法(EDSN),以确保来自不同类别的新增加的步态样本在先前和当前的步态任务之间保持相似的空间分布关系。这样可以减轻任务变化对原始领域特征区分能力的影响。 广泛的评估表明,GaitAdapter能够有效保留从各种任务中获得的步态知识,并且与替代方法相比,在区别能力方面表现出显著的优势。
https://arxiv.org/abs/2508.03375
Gait recognition aims to identify individuals based on their body shape and walking patterns. Though much progress has been achieved driven by deep learning, gait recognition in real-world surveillance scenarios remains quite challenging to current methods. Conventional approaches, which rely on periodic gait cycles and controlled environments, struggle with the non-periodic and occluded silhouette sequences encountered in the wild. In this paper, we propose a novel framework, TrackletGait, designed to address these challenges in the wild. We propose Random Tracklet Sampling, a generalization of existing sampling methods, which strikes a balance between robustness and representation in capturing diverse walking patterns. Next, we introduce Haar Wavelet-based Downsampling to preserve information during spatial downsampling. Finally, we present a Hardness Exclusion Triplet Loss, designed to exclude low-quality silhouettes by discarding hard triplet samples. TrackletGait achieves state-of-the-art results, with 77.8 and 80.4 rank-1 accuracy on the Gait3D and GREW datasets, respectively, while using only 10.3M backbone parameters. Extensive experiments are also conducted to further investigate the factors affecting gait recognition in the wild.
步态识别旨在通过人体形状和行走模式来识别个体。尽管在深度学习的推动下已取得了显著进展,但在实际监控场景中进行步态识别仍然具有挑战性。传统的依赖于周期性的步行循环和受控环境的方法,在面对非周期性和被遮挡的身体轮廓序列时难以应对。 本文提出了一种新型框架 TrackletGait,旨在解决野外环境中遇到的这些挑战。我们提出了随机片段采样(Random Tracklet Sampling),这是一种现有采样方法的泛化版本,能够在捕捉多样化行走模式的同时保持鲁棒性与表现力之间的平衡。接下来,我们介绍了基于哈尔小波的下采样(Haar Wavelet-based Downsampling),该技术在空间下采样的过程中能够保存信息。最后,我们提出了一种难度排除三元组损失(Hardness Exclusion Triplet Loss),旨在通过丢弃难以处理的三元组样本来排除低质量的身体轮廓。 TrackletGait 实现了最先进的结果,在 Gait3D 和 GREW 数据集上分别达到了 77.8% 和 80.4% 的首位准确率,同时仅使用了10.3M个骨干参数。此外,还进行了广泛的实验以进一步探讨影响野外步态识别的因素。 通过这些创新性的方法和技术,TrackletGait 能够在实际监控场景中更有效地实现步态识别。
https://arxiv.org/abs/2508.02143