Paper Reading AI Learner

Explainable Parkinsons Disease Gait Recognition Using Multimodal RGB-D Fusion and Large Language Models

2025-12-04 03:43:43
Manar Alnaasan, Md Selim Sarowar, Sungho Kim

Abstract

Accurate and interpretable gait analysis plays a crucial role in the early detection of Parkinsons disease (PD),yet most existing approaches remain limited by single-modality inputs, low robustness, and a lack of clinical transparency. This paper presents an explainable multimodal framework that integrates RGB and Depth (RGB-D) data to recognize Parkinsonian gait patterns under realistic conditions. The proposed system employs dual YOLOv11-based encoders for modality-specific feature extraction, followed by a Multi-Scale Local-Global Extraction (MLGE) module and a Cross-Spatial Neck Fusion mechanism to enhance spatial-temporal representation. This design captures both fine-grained limb motion (e.g., reduced arm swing) and overall gait dynamics (e.g., short stride or turning difficulty), even in challenging scenarios such as low lighting or occlusion caused by clothing. To ensure interpretability, a frozen Large Language Model (LLM) is incorporated to translate fused visual embeddings and structured metadata into clinically meaningful textual explanations. Experimental evaluations on multimodal gait datasets demonstrate that the proposed RGB-D fusion framework achieves higher recognition accuracy, improved robustness to environmental variations, and clear visual-linguistic reasoning compared with single-input baselines. By combining multimodal feature learning with language-based interpretability, this study bridges the gap between visual recognition and clinical understanding, offering a novel vision-language paradigm for reliable and explainable Parkinsons disease gait analysis. Code:this https URL

Abstract (translated)

准确且可解释的步态分析在帕金森病(PD)的早期检测中起着关键作用,然而大多数现有的方法仍然受到单模态输入、低鲁棒性和临床透明度不足的限制。本文提出了一种基于RGB和深度(RGB-D)数据的可解释多模态框架,旨在识别现实条件下帕金森步态模式。 该系统采用两个基于YOLOv11的编码器来提取特定于每种模态的特征,并随后通过一个多尺度局部-全局提取(MLGE)模块及跨空间颈部融合机制增强了时空表示。这种设计即使在光照不足或衣物遮挡等挑战性场景下,也能捕捉到细微的肢体运动(如减少的手臂摆动)和整体步态动态(如短步长或多转困难)。为了确保可解释性,在该框架中整合了一个冻结的大规模语言模型(LLM),用于将融合后的视觉嵌入与结构化元数据转换为具有临床意义的文字说明。 在多模态步态数据集上的实验评估表明,所提出的RGB-D融合框架相比单输入基线方法实现了更高的识别精度、更强的环境变化鲁棒性和更清晰的视听语言推理。通过结合多模式特征学习和基于语言的理解能力,这项研究弥合了视觉识别与临床理解之间的差距,并为可靠的可解释帕金森步态分析提供了一种新颖的视觉-语言范式。 代码链接:请访问提供的URL查看相关代码。

URL

https://arxiv.org/abs/2512.04425

PDF

https://arxiv.org/pdf/2512.04425.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot