Paper Reading AI Learner

A Real-Time Human Action Recognition Model for Assisted Living

2025-03-18 20:22:17
Yixuan Wang, Paul Stynes, Pramod Pathak, Cristina Muntean

Abstract

Ensuring the safety and well-being of elderly and vulnerable populations in assisted living environments is a critical concern. Computer vision presents an innovative and powerful approach to predicting health risks through video monitoring, employing human action recognition (HAR) technology. However, real-time prediction of human actions with high performance and efficiency is a challenge. This research proposes a real-time human action recognition model that combines a deep learning model and a live video prediction and alert system, in order to predict falls, staggering and chest pain for residents in assisted living. Six thousand RGB video samples from the NTU RGB+D 60 dataset were selected to create a dataset with four classes: Falling, Staggering, Chest Pain, and Normal, with the Normal class comprising 40 daily activities. Transfer learning technique was applied to train four state-of-the-art HAR models on a GPU server, namely, UniFormerV2, TimeSformer, I3D, and SlowFast. Results of the four models are presented in this paper based on class-wise and macro performance metrics, inference efficiency, model complexity and computational costs. TimeSformer is proposed for developing the real-time human action recognition model, leveraging its leading macro F1 score (95.33%), recall (95.49%), and precision (95.19%) along with significantly higher inference throughput compared to the others. This research provides insights to enhance safety and health of the elderly and people with chronic illnesses in assisted living environments, fostering sustainable care, smarter communities and industry innovation.

Abstract (translated)

确保在辅助生活环境中老年人和弱势群体的安全与健康是至关重要的问题。计算机视觉通过视频监控提供了一种创新且强大的方法,利用人体行为识别(HAR)技术来预测健康风险。然而,实现高性能、高效率的实时人体动作预测是一个挑战。本研究提出了一种结合深度学习模型和实时视频预测及警报系统的实时人体行为识别模型,旨在预测辅助生活环境中居民的跌倒、踉跄和胸痛情况。从NTU RGB+D 60数据集中选择了六千个RGB视频样本,构建了一个包含四类(跌倒、踉跄、胸痛和正常)的数据集,其中“正常”类别包括40种日常活动。采用迁移学习技术在GPU服务器上训练了四种最先进的HAR模型:UniFormerV2、TimeSformer、I3D 和 SlowFast。本文根据各类别的性能指标(如F1分数)、宏观性能指标以及推理效率和模型复杂度及计算成本,展示了这四个模型的结果。基于其卓越的宏观F1得分(95.33%)、召回率(95.49%)和准确率(95.19%),并具有显著高于其他模型的推理吞吐量,TimeSformer被提议用于开发实时人体行为识别模型。这项研究为提高辅助生活环境中老年人及慢性病患者的安全与健康提供了见解,并促进了可持续护理、智能社区以及行业创新的发展。

URL

https://arxiv.org/abs/2503.18957

PDF

https://arxiv.org/pdf/2503.18957.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot