Gait recognition, as a promising biometric technology, identifies individuals through their unique walking patterns and offers distinctive advantages including non-invasiveness, long-range applicability, and resistance to deliberate disguise. Despite these merits, capturing the intrinsic motion patterns concealed within consecutive video frames remains challenging due to the complexity of video data and the interference of external covariates such as viewpoint changes, clothing variations, and carrying conditions. Existing approaches predominantly rely on either static appearance features extracted from individual silhouette frames or employ complex sequential models (\eg, LSTM, 3D convolutions) that demand substantial computational resources and sophisticated training strategies. To address these limitations, we propose a Local Spatiotemporal Convolutional Network (LSTCN), a structurally simple yet highly effective dual-branch architecture that endows standard two-dimensional convolutional networks with the capacity to extract temporal information. Specifically, we introduce a Global Bidirectional Spatial Pooling (GBSP) mechanism that reduces the dimensionality of gait tensors by decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations. Building upon this, we design a Local Spatiotemporal Convolutional (LSTC) layer that jointly processes temporal and spatial dimensions, allowing the network to adaptively learn strip-based gait motion patterns. We further extend this formulation with asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains, thereby enriching the extracted feature representations.
https://arxiv.org/abs/2605.14548
Conventional gait de-identification methods often encounter an inherent trade-off: they either provide insufficient identity suppression or introduce spatiotemporal distortions that impede structure-sensitive downstream applications. We propose GaitProtector, an impersonation-driven gait de-identification framework that formulates privacy protection as a unified objective with two tightly coupled components: (i) obfuscation, which repels the protected gait from the source identity, and (ii) impersonation, which attracts it toward a selected target identity. The target identity serves as a semantic anchor that biases optimization toward structurally plausible gait patterns under the pretrained diffusion prior, helping preserve dominant body shape and motion dynamics. We instantiate this idea through a training-free diffusion latent optimization pipeline. Instead of retraining a generator for each dataset, we invert each input silhouette sequence into the latent trajectory of a pretrained 3D video diffusion model and iteratively optimize latent codes with a differentiable adversarial objective to synthesize protected gaits. Experiments on the CASIA-B dataset show that GaitProtector achieves a 56.7% impersonation success rate under black-box gait recognition and reduces Rank-1 identification accuracy from 89.6% to 15.0%, while maintaining favorable visual and temporal quality. We further evaluate downstream utility on the Scoliosis1K dataset, where diagnostic accuracy decreases only from 91.4% to 74.2%. To the best of our knowledge, this work is the first to leverage pretrained 3D diffusion priors in a training-free manner for silhouette-based gait de-identification.
https://arxiv.org/abs/2605.12431
Gait recognition has emerged as a compelling biometric modality for surveillance and security applications, offering inherent advantages such as non-intrusiveness, resistance to disguise, and long-range identification capability. However, prevailing approaches struggle to comprehensively capture and exploit the rich biometric cues embedded in human locomotion, particularly under covariate interference including viewpoint variation, clothing change, and carrying conditions. In this paper, we present a high-precision gait recognition framework that deeply extracts and synergistically fuses gait dynamics with body shape characteristics through a multi-branch architecture grounded in deep residual learning. Specifically, we first employ the High-Resolution Network (HRNet) to perform robust skeletal keypoint estimation, preserving fine-grained spatial information even under low-resolution inputs. We then construct three complementary feature branches -- body proportion, gait velocity, and skeletal motion -- from the extracted pose sequences. A 50-layer Residual Network (ResNet-50) backbone is leveraged within a deep feature extraction module to capture hierarchically rich and discriminative representations. To effectively integrate heterogeneous feature streams, we design a Multi-Branch Feature Fusion (MFF) module inspired by channel-wise attention mechanisms, which dynamically allocates contribution weights across branches through learned activation parameters. Extensive experiments on the cross-view multi-condition CASIA-B benchmark demonstrate that our method achieves a Rank-1 accuracy of 94.52\% under normal walking, with the best recognition performance among skeleton-based methods for the coat-wearing condition.
https://arxiv.org/abs/2604.27353
Gait recognition is an attractive biometric modality for long-range and contact-free identification, but high-performing gait models often rely on deep and computationally expensive architectures that are difficult to deploy in practice. Knowledge distillation (KD) offers a natural way to transfer knowledge from a powerful teacher to an efficient student; however, standard KD is often less effective for part-structured gait models, where supervision is formed from both part-wise classification logits and part-wise retrieval embeddings. In this paper, we propose GaitKD, a distillation framework that decouples gait knowledge transfer into two complementary components: decision-level distillation and boundary-level distillation. Specifically, GaitKD aligns the teacher and student through part-calibrated logit distillation to transfer inter-class decision relations, while preserving the teacher-induced partitioning of the embedding space through an activation-boundary objective instead of direct feature regression. With a simple aligned part-wise design, GaitKD supports heterogeneous teacher-student gait models without introducing additional inference cost. Experimental results across multiple gait recognition benchmarks and teacher-student configurations show consistent improvements over strong gait baselines. Our study demonstrates that the two transfer components are complementary, and boundary-preserving distillation provides more stable performance than direct feature regression. Source code is available at this https URL
https://arxiv.org/abs/2604.26255
Gait recognition has emerged as a powerful biometric technique for identifying individuals at a distance without requiring user cooperation. Most existing methods focus primarily on RGB-derived modalities, which fall short in real-world scenarios requiring multi-modal collaboration and cross-modal retrieval. To overcome these challenges, we present MMGait, a comprehensive multi-modal gait benchmark integrating data from five heterogeneous sensors, including an RGB camera, a depth camera, an infrared camera, a LiDAR scanner, and a 4D Radar system. MMGait contains twelve modalities and 334,060 sequences from 725 subjects, enabling systematic exploration across geometric, photometric, and motion domains. Based on MMGait, we conduct extensive evaluations on single-modal, cross-modal, and multi-modal paradigms to analyze modality robustness and complementarity. Furthermore, we introduce a new task, Omni Multi-Modal Gait Recognition, which aims to unify the above three gait recognition paradigms within a single model. We also propose a simple yet powerful baseline, OmniGait, which learns a shared embedding space across diverse modalities and achieves promising recognition performance. The MMGait benchmark, codebase, and pretrained checkpoints are publicly available at this https URL.
https://arxiv.org/abs/2604.15979
Gait recognition, as a reliable biometric technology, has seen rapid development in recent years while it faces significant challenges caused by diverse clothing styles in the real world. This paper introduces BarbieGait, a synthetic gait dataset where real-world subjects are uniquely mapped into a virtual engine to simulate extensive clothing changes while preserving their gait identity information. As a pioneering work, BarbieGait provides a controllable gait data generation method, enabling the production of large datasets to validate cross-clothing issues that are difficult to verify with real-world data. However, the diversity of clothing increases intra-class variance and makes one of the biggest challenges to learning cloth-invariant features under varying clothing conditions. Therefore, we propose GaitCLIF (Gait-oriented CLoth-Invariant Feature) as a robust baseline model for cross-clothing gait recognition. Through extensive experiments, we validate that our method significantly improves cross-clothing performance on BarbieGait and the existing popular gait benchmarks. We believe that BarbieGait, with its extensive cross-clothing gait data, will further advance the capabilities of gait recognition in cross-clothing scenarios and promote progress in related research.
https://arxiv.org/abs/2604.12221
Gait recognition is a biometric modality that identifies individuals from their characteristic walking patterns. Unlike conventional biometric traits, gait can be acquired at a distance and without active subject cooperation, making it suitable for surveillance and public safety applications. Nevertheless, silhouette-based temporal models remain sensitive to long sequences, observation noise, and appearance-related covariates. Recurrent architectures often struggle to preserve information from earlier frames and are inherently sequential to optimize, whereas transformer-based models typically require greater computational resources and larger training sets and may be sensitive to irregular sequence lengths and noisy inputs. These limitations reduce robustness under clothing variation, carrying conditions, and view changes, while also hindering the joint modeling of local gait cycles and longer-term motion trends. To address these challenges, we introduce a Temporal Kolmogorov-Arnold Network (TKAN) for gait recognition. The proposed model replaces fixed edge weights with learnable one-dimensional functions and incorporates a two-level memory mechanism consisting of short-term RKAN sublayers and a gated long-term pathway. This design enables efficient modeling of both cycle-level dynamics and broader temporal context while maintaining a compact backbone. Experiments on the CASIA-B dataset indicate that the proposed CNN+TKAN framework achieves strong recognition performance under the reported evaluation setting.
https://arxiv.org/abs/2604.09990
Skeleton-based gait recognizers excel at modeling spatial configurations but often underuse explicit motion dynamics that are crucial under appearance changes. We introduce a plug-and-play Wavelet Feature Stream that augments any skeleton backbone with time-frequency dynamics of joint velocities. Concretely, per-joint velocity sequences are transformed by the continuous wavelet transform (CWT) into multi-scale scalograms, from which a lightweight multi-scale CNN learns discriminative dynamic cues. The resulting descriptor is fused with the backbone representation for classification, requiring no changes to the backbone architecture or additional supervision. Across CASIA-B, the proposed stream delivers consistent gains on strong skeleton backbones (e.g., GaitMixer, GaitFormer, GaitGraph) and establishes a new skeleton-based state of the art when attached to GaitMixer. The improvements are especially pronounced under covariate shifts such as carrying bags (BG) and wearing coats (CL), highlighting the complementarity of explicit time-frequency modeling and standard spatio-temporal encoders.
基于骨架的步态识别器擅长建模空间构型,但往往未能充分挖掘在外观变化下至关重要的显式运动动态。我们提出了一种即插即用的小波特征流,可通过关节速度的时频动态增强任意骨架主干网络。具体而言,对每个关节的速度序列进行连续小波变换(CWT),生成多尺度时频图,随后一个轻量级多尺度CNN从中学习判别性动态线索。所得描述符与主干网络表征融合用于分类,无需修改主干架构或额外监督。在CASIA-B数据集上,该特征流在多种强骨架主干网络(如GaitMixer、GaitFormer、GaitGraph)上均实现稳定提升,当与GaitMixer结合时更创下基于骨架方法的新纪录。尤其在携带包(BG)和穿外套(CL)等外观变化场景下,改进尤为显著,凸显了显式时频建模与标准时空编码器的互补性。
https://arxiv.org/abs/2604.03002
Frailty is a condition in aging medicine characterized by diminished physiological reserve and increased vulnerability to stressors. However, frailty assessment remains subjective, heterogeneous, and difficult to scale in clinical practice. Gait is a sensitive marker of biological aging, capturing multisystem decline before overt disability. Yet the application of modern computer vision to gait-based frailty assessment has been limited by small, imbalanced datasets and a lack of clinically representative benchmarks. In this work, we introduce a publicly available silhouette-based frailty gait dataset collected in a clinically realistic setting, spanning the full frailty spectrum and including older adults who use walking aids. Using this dataset, we evaluate how pretrained gait recognition models can be adapted for frailty classification under limited data conditions. We study both convolutional and hybrid attention-based architectures and show that predictive performance depends primarily on how pretrained representations are transferred rather than architectural complexity alone. Across models, selectively freezing low-level gait representations while allowing higher-level features to adapt yields more stable and generalizable performance than either full fine-tuning or rigid freezing. Conservative handling of class imbalance further improves training stability, and combining complementary learning objectives enhances discrimination between clinically adjacent frailty states. Interpretability analyses reveal consistent model attention to lower-limb and pelvic regions, aligning with established biomechanical correlates of frailty. Together, these findings establish gait-based representation learning as a scalable, non-invasive, and interpretable framework for frailty assessment and support the integration of modern biometric modeling approaches into aging research and clinical practice.
衰弱症是老年医学中的一种状态,其特征是生理储备下降和对压力源的易感性增加。然而,在临床实践中,衰弱评估仍具有主观性、异质性且难以规模化。步态是生物衰老的敏感标志,能在明显失能前捕捉多系统衰退。但现代计算机视觉在基于步态的衰弱评估中的应用,一直受限于数据集规模小、类别不平衡以及缺乏临床代表性基准。在本研究中,我们发布了一个在临床真实场景下收集的、基于轮廓的衰弱步态公开数据集,该数据集覆盖了完整的衰弱谱系,并包含了使用助行器的老年人。利用此数据集,我们评估了预训练步态识别模型如何在数据有限条件下适配于衰弱分类任务。我们研究了卷积架构和混合注意力架构,并证明预测性能主要取决于预训练表征的迁移方式,而非单纯依赖架构复杂度。跨模型分析表明,与全参数微调或僵化冻结相比,选择性冻结低层步态表征同时允许高层特征自适应,能获得更稳定且泛化性更好的性能。对类别不平衡的保守处理进一步提升了训练稳定性,而结合互补学习目标则增强了对临床相邻衰弱状态的判别能力。可解释性分析显示,模型持续关注下肢和骨盆区域,这与已确立的衰弱生物力学相关指标一致。综上,这些发现确立了基于步态的表征学习作为一种可规模化、非侵入且可解释的衰弱评估框架,并支持将现代生物特征建模方法整合到衰老研究和临床实践中。
https://arxiv.org/abs/2603.24434
Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.
步态剪影可被编码为二值步态编码,广泛用于表征行人运动模式。近期方法通常利用视觉主干网络对步态剪影进行编码,取得了良好性能。然而这些方法主要关注连续视觉特征,忽略了二值剪影的离散特性——其本质上与自然语言共享离散编码空间。大语言模型(LLMs)已展现出从离散序列中提取判别特征及建模长程依赖的卓越能力,凸显了其通过捕捉细微变化来理解时序运动模式的潜力。受此启发,我们尝试在二值编码空间内建立步态剪影与自然语言的关联。然而文本标记与二值步态剪影的编码空间仍存在错配,主要源于标记频率与密度的差异。为此,我们提出轮廓-速度分词器,在编码二值步态剪影的同时重塑其分布以更好对齐文本标记空间。进而构建名为剪影语言模型的双分支框架,通过融合源自大语言模型的离散语言嵌入来增强视觉剪影表征。该方法在主流步态主干网络上实现,于SUSTech1K、GREW及Gait3D数据集上持续提升现有最优方法性能。
https://arxiv.org/abs/2603.23976
Pathological gait analysis is constrained by limited and variable clinical datasets, which restrict the modeling of diverse gait impairments. To address this challenge, we propose a Pathological Gait-conditioned Generative Adversarial Network (PGcGAN) that synthesises pathology-specific gait sequences directly from observed 3D pose keypoint trajectories data. The framework incorporates one-hot encoded pathology labels within both the generator and discriminator, enabling controlled synthesis across six gait categories. The generator adopts a conditional autoencoder architecture trained with adversarial and reconstruction objectives to preserve structural and temporal gait characteristics. Experiments on the Pathological Gait Dataset demonstrate strong alignment between real and synthetic sequences through PCA and t-SNE analyses, visual kinematic inspection, and downstream classification tasks. Augmenting real data with synthetic sequences improved pathological gait recognition across GRU, LSTM, and CNN models, indicating that pathology-conditioned gait synthesis can effectively support data augmentation in pathological gait analysis.
https://arxiv.org/abs/2603.14409
Gait recognition is an emerging biometric technology that enables non-intrusive and hard-to-spoof human identification. However, most existing methods are confined to short-range, unimodal settings and fail to generalize to long-range and cross-distance scenarios under real-world conditions. To address this gap, we present \textbf{LRGait}, the first LiDAR-Camera multimodal benchmark designed for robust long-range gait recognition across diverse outdoor distances and environments. We further propose \textbf{EMGaitNet}, an end-to-end framework tailored for long-range multimodal gait recognition. To bridge the modality gap between RGB images and point clouds, we introduce a semantic-guided fusion pipeline. A CLIP-based Semantic Mining (SeMi) module first extracts human body-part-aware semantic cues, which are then employed to align 2D and 3D features via a Semantic-Guided Alignment (SGA) module within a unified embedding space. A Symmetric Cross-Attention Fusion (SCAF) module hierarchically integrates visual contours and 3D geometric features, and a Spatio-Temporal (ST) module captures global gait dynamics. Extensive experiments on various gait datasets validate the effectiveness of our method.
https://arxiv.org/abs/2603.14189
Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.
步态识别是一种用于安全应用的非侵入性生物特征技术,但现有的研究主要集中在轮廓和解析表示上。轮廓是稀疏的,并且缺少内部结构细节,这限制了其鉴别能力;而解析则通过引入部分级结构来丰富轮廓,不过它严重依赖于上游的人体解析器(例如标签粒度和边界精度),导致在不同数据集上的表现不稳定,在某些情况下甚至比单纯的轮廓识别效果更差。我们从结构角度重新审视步态表示,并定义了一个由边缘密度和监督形式构成的设计空间:轮廓使用稀疏的边界边缘并带有弱单标签监督,而解析则利用更密集的线索以及强大的语义先验。在这个设计空间中,我们发现了一种被忽视的方法:没有显式的语义标签但具有密集的部分级结构,并引入SKETCH作为一种新的步态识别视觉模式。通过基于边缘检测器从RGB图像直接提取高频结构线索(例如肢体关节和自我遮挡轮廓),Sketch在无标签的情况下工作。我们进一步证明,引导式解析与非引导式草图在语义上是解耦的,在结构上是互补的。基于此,我们提出了SKETCHGAIT,这是一个分层解缠的多模态框架,包括两个独立的学习流和一个轻量级的早期融合分支来捕捉结构上的互补性。在SUSTech1K和CCPG数据集上的广泛实验验证了所提出的模式和框架的有效性:SketchGait在SUSTech1K上达到了92.9%的第一名准确率,并且在CCPG上实现了93.1%的平均第一名准确率。
https://arxiv.org/abs/2603.05537
Human identification at a distance (HID) is challenging because traditional biometric modalities such as face and fingerprints are often difficult to acquire in real-world scenarios. Gait recognition provides a practical alternative, as it can be captured reliably at a distance. To promote progress in gait recognition and provide a fair evaluation platform, the International Competition on Human Identification at a Distance (HID) has been organized annually since 2020. Since 2023, the competition has adopted the challenging SUSTech-Competition dataset, which features substantial variations in clothing, carried objects, and view angles. No dedicated training data are provided, requiring participants to train their models using external datasets. Each year, the competition applies a different random seed to generate distinct evaluation splits, which reduces the risk of overfitting and supports a fair assessment of cross-domain generalization. While HID 2023 and HID 2024 already used this dataset, HID 2025 explicitly examined whether algorithmic advances could surpass the accuracy limits observed previously. Despite the heightened difficulty, participants achieved further improvements, and the best-performing method reached 94.2% accuracy, setting a new benchmark on this dataset. We also analyze key technical trends and outline potential directions for future research in gait recognition.
远距离人类识别(HID)具有挑战性,因为传统的生物特征模态如面部和指纹在现实场景中往往难以采集。步态识别提供了一个实用的替代方案,因为它可以在较远的距离可靠地捕捉到个体的行走方式。为了促进步态识别的进步并为该领域提供一个公平的评估平台,国际远距离人类识别竞赛(HID)自2020年起每年举办一次。从2023年开始,比赛采用了具有挑战性的SUSTech-Competition数据集,该数据集包含了服装、携带物品和视角角度方面的显著变化。由于没有提供专门用于训练的数据集,参赛者必须使用外部数据集来训练他们的模型。竞赛每年应用不同的随机种子生成不同的评估分组,这减少了过拟合的风险,并支持跨域泛化的公平评价。尽管HID 2023和HID 2024已经采用了这个数据集,但HID 2025明确地检查了算法进步是否可以超过先前观察到的准确性限制。即使难度增加,参赛者仍然取得了进一步的进步,最优秀的方法达到了94.2%的准确率,在该数据集上设立了新的基准。我们还分析了关键技术趋势,并概述了步态识别未来研究的潜在方向。
https://arxiv.org/abs/2602.07565
In skeleton-based human activity understanding, existing methods often adopt the contrastive learning paradigm to construct a discriminative feature space. However, many of these approaches fail to exploit the structural inter-class similarities and overlook the impact of anomalous positive samples. In this study, we introduce ACLNet, an Affinity Contrastive Learning Network that explores the intricate clustering relationships among human activity classes to improve feature discrimination. Specifically, we propose an affinity metric to refine similarity measurements, thereby forming activity superclasses that provide more informative contrastive signals. A dynamic temperature schedule is also introduced to adaptively adjust the penalty strength for various superclasses. In addition, we employ a margin-based contrastive strategy to improve the separation of hard positive and negative samples within classes. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, Kinetics-Skeleton, PKU-MMD, FineGYM, and CASIA-B demonstrate the superiority of our method in skeleton-based action recognition, gait recognition, and person re-identification. The source code is available at this https URL.
https://arxiv.org/abs/2601.16694
Gait recognition is emerging as a promising technology and an innovative field within computer vision. However, existing methods typically rely on complex architectures to directly extract features from images and apply pooling operations to obtain sequence-level representations. Such designs often lead to overfitting on static noise (e.g., clothing), while failing to effectively capture dynamic motion this http URL address the above challenges, we present a Language guided and Motion-aware gait recognition framework, named this http URL particular, we utilize designed gait-related language cues to capture key motion features in gait sequences.
步态识别正逐渐成为计算机视觉领域中一项有前景的技术和创新领域。然而,现有的方法通常依赖于复杂的架构直接从图像中提取特征,并应用池化操作来获取序列级别的表示。这样的设计往往会导致过度拟合静态噪声(例如服装),而无法有效捕捉动态运动信息。为了解决上述挑战,我们提出了一种语言引导且注重运动的步态识别框架,命名为LangMoGR。 具体而言,我们利用了专门设计的步态相关语言线索来捕捉步态序列中的关键运动特征。
https://arxiv.org/abs/2601.11931
The goal of gait recognition is to extract identity-invariant features of an individual under various gait conditions, e.g., cross-view and cross-clothing. Most gait models strive to implicitly learn the common traits across different gait conditions in a data-driven manner to pull different gait conditions closer for recognition. However, relatively few studies have explicitly explored the inherent relations between different gait conditions. For this purpose, we attempt to establish connections among different gait conditions and propose a new perspective to achieve gait recognition: variations in different gait conditions can be approximately viewed as a combination of geometric transformations. In this case, all we need is to determine the types of geometric transformations and achieve geometric invariance, then identity invariance naturally follows. As an initial attempt, we explore three common geometric transformations (i.e., Reflect, Rotate, and Scale) and design a $\mathcal{R}$eflect-$\mathcal{R}$otate-$\mathcal{S}$cale invariance learning framework, named ${\mathcal{RRS}}$-Gait. Specifically, it first flexibly adjusts the convolution kernel based on the specific geometric transformations to achieve approximate feature equivariance. Then these three equivariant-aware features are respectively fed into a global pooling operation for final invariance-aware learning. Extensive experiments on four popular gait datasets (Gait3D, GREW, CCPG, SUSTech1K) show superior performance across various gait conditions.
步态识别的目标是从不同步态条件下提取个体的身份不变特征,例如跨视角和跨服装条件。大多数步态模型致力于通过数据驱动的方式隐式学习不同步态条件下的共同特性,以使不同的步态条件在识别时更加接近。然而,相对较少的研究明确探讨了不同步态条件之间的内在关系。为此,我们试图建立不同步态条件之间的联系,并提出了一种新的实现步态识别的视角:不同步态条件的变化可以近似视为几何变换的组合。在这种情况下,我们需要确定各种几何变换类型并实现几何不变性,则身份不变性自然随之而来。 作为初步尝试,我们探索了三种常见的几何变换(即反射、旋转和缩放),并设计了一个名为$\mathcal{RRS}$-Gait的反射-$\mathcal{R}$otate-$\mathcal{S}$cale不变性学习框架。具体来说,它首先根据特定的几何变换灵活调整卷积核以实现近似的特征等变性(equivariance)。然后将这三种具有等变性的特征分别输入全局池化操作进行最终的不变性感知学习。 在四个流行的步态数据集(Gait3D、GREW、CCPG和SUSTech1K)上进行了广泛的实验,结果表明该方法在各种步态条件下均表现出优异性能。
https://arxiv.org/abs/2601.05604
Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL's ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.
人类动作理解通过基于视觉的进步在识别、跟踪和描述方面取得了迅速的发展。然而,大多数现有的方法忽视了诸如关节作用力等生物力学中基本的物理线索。这一差距激发了我们的研究:物理推断出的力量会在何时何地增强对运动的理解?通过将力量融入到已建立的动作理解流程中,我们系统性地评估了它们在三个主要任务上的影响:步态识别、动作识别和细粒度视频描述。在八个基准测试上,引入力量后性能均有持续提升;例如,在CASIA-B数据集上,Rank-1步态识别准确率从89.52%提高到90.39%(+0.87),尤其是在挑战条件下观察到了更大的增益:穿外套时提高了2.7%,侧面视角下提高了3.0%。在Gait3D数据集上,性能也由46.0%提升至47.3%(+1.3)。在动作识别任务中,CTR-GCN模型在Penn Action上的表现提升了2.00%,而高耗力的动作类别如拳击/拍打则获得了高达6.96%的改进。即使是在视频描述领域,Qwen2.5-VL的ROUGE-L评分也从0.310上升至0.339(+0.029),表明物理推断的力量增强了时间定位和语义丰富度。 这些结果证明,在动态、被遮挡或外观变化的情况下,力线索可以显著补充视觉和运动学特征。
https://arxiv.org/abs/2512.20451
Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.
生成式人工智能(GenAI)模型已经革新了动画领域,能够以惊人的视觉逼真度合成人类形象和动作模式。然而,生成真正逼真的真人动画仍然是一个艰巨的挑战,即使是细微的不一致性也会使角色显得不自然。当评估行为生物识别时,这一限制尤为关键,在这种情况下,定义身份的微妙运动线索很容易丢失或失真。本研究调查了最先进的GenAI人体动画模型能否在步态生物识别中保留用于人员识别所需的微小时空细节。具体而言,我们针对四个不同的GenAI模型进行两项主要评估任务来考察其能力:一是从参考视频中恢复步态模式,在不同复杂度条件下;二是将这些步态模式转移到不同的视觉身份上。我们的结果显示,虽然视觉质量大多很高,但在人员识别任务中生物特征的准确性仍然很低,这表明目前的GenAI模型在解耦身份和运动方面存在困难。此外,通过一项身份转换任务,我们揭示了基于外观的步态识别中的一个基本缺陷:当纹理从运动中分离出来时,识别效果会崩溃,证明当前的GenAI模型依赖于视觉属性而非时间动态。
https://arxiv.org/abs/2512.19275
Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:this https URL
准确且可解释的步态分析在帕金森病(PD)的早期检测中起着关键作用,然而大多数现有的方法仍然受到单模态输入、低鲁棒性和临床透明度不足的限制。本文提出了一种基于RGB和深度(RGB-D)数据的可解释多模态框架,旨在识别现实条件下帕金森步态模式。 该系统采用两个基于YOLOv11的编码器来提取特定于每种模态的特征,并随后通过一个多尺度局部-全局提取(MLGE)模块及跨空间颈部融合机制增强了时空表示。这种设计即使在光照不足或衣物遮挡等挑战性场景下,也能捕捉到细微的肢体运动(如减少的手臂摆动)和整体步态动态(如短步长或多转困难)。为了确保可解释性,在该框架中整合了一个冻结的大规模语言模型(LLM),用于将融合后的视觉嵌入与结构化元数据转换为具有临床意义的文字说明。 在多模态步态数据集上的实验评估表明,所提出的RGB-D融合框架相比单输入基线方法实现了更高的识别精度、更强的环境变化鲁棒性和更清晰的视听语言推理。通过结合多模式特征学习和基于语言的理解能力,这项研究弥合了视觉识别与临床理解之间的差距,并为可靠的可解释帕金森步态分析提供了一种新颖的视觉-语言范式。 代码链接:请访问提供的URL查看相关代码。
https://arxiv.org/abs/2512.04425