Paper Reading AI Learner

SilLang: Improving Gait Recognition with Silhouette Language Encoding

2026-03-25 06:15:29
Ruiyi Zhan, Guozhen Peng, Canyu Chen, Jian Lei, Annan Li

Abstract

Gait silhouettes, which can be encoded into binary gait codes, are widely adopted to representing motion patterns of pedestrian. Recent approaches commonly leverage visual backbones to encode gait silhouettes, achieving successful performance. However, they primarily focus on continuous visual features, overlooking the discrete nature of binary silhouettes that inherently share a discrete encoding space with natural language. Large Language Models (LLMs) have demonstrated exceptional capability in extracting discriminative features from discrete sequences and modeling long-range dependencies, highlighting their potential to capture temporal motion patterns by identifying subtle variations. Motivated by these observations, we explore bridging binary gait silhouettes and natural language within a binary encoding space. However, the encoding spaces of text tokens and binary gait silhouettes remain misaligned, primarily due to differences in token frequency and density. To address this issue, we propose the Contour-Velocity Tokenizer, which encodes binary gait silhouettes while reshaping their distribution to better align with the text token space. We then establish a dual-branch framework termed Silhouette Language Model, which enhances visual silhouettes by integrating discrete linguistic embeddings derived from LLMs. Implemented on mainstream gait backbones, SilLang consistently improves state-of-the-art methods across SUSTech1K, GREW, and Gait3D.

Abstract (translated)

步态剪影可被编码为二值步态编码,广泛用于表征行人运动模式。近期方法通常利用视觉主干网络对步态剪影进行编码,取得了良好性能。然而这些方法主要关注连续视觉特征,忽略了二值剪影的离散特性——其本质上与自然语言共享离散编码空间。大语言模型(LLMs)已展现出从离散序列中提取判别特征及建模长程依赖的卓越能力,凸显了其通过捕捉细微变化来理解时序运动模式的潜力。受此启发,我们尝试在二值编码空间内建立步态剪影与自然语言的关联。然而文本标记与二值步态剪影的编码空间仍存在错配,主要源于标记频率与密度的差异。为此,我们提出轮廓-速度分词器,在编码二值步态剪影的同时重塑其分布以更好对齐文本标记空间。进而构建名为剪影语言模型的双分支框架,通过融合源自大语言模型的离散语言嵌入来增强视觉剪影表征。该方法在主流步态主干网络上实现,于SUSTech1K、GREW及Gait3D数据集上持续提升现有最优方法性能。

URL

https://arxiv.org/abs/2603.23976

PDF

https://arxiv.org/pdf/2603.23976.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot