Paper Reading AI Learner

Skeleton-Based Intake Gesture Detection With Spatial-Temporal Graph Convolutional Networks

2025-04-14 18:35:32
Chunzhuo Wang, Zhewen Xue, T. Sunil Kumar, Guido Camps, Hans Hallez, Bart Vanrumste

Abstract

Overweight and obesity have emerged as widespread societal challenges, frequently linked to unhealthy eating patterns. A promising approach to enhance dietary monitoring in everyday life involves automated detection of food intake gestures. This study introduces a skeleton based approach using a model that combines a dilated spatial-temporal graph convolutional network (ST-GCN) with a bidirectional long-short-term memory (BiLSTM) framework, as called ST-GCN-BiLSTM, to detect intake gestures. The skeleton-based method provides key benefits, including environmental robustness, reduced data dependency, and enhanced privacy preservation. Two datasets were employed for model validation. The OREBA dataset, which consists of laboratory-recorded videos, achieved segmental F1-scores of 86.18% and 74.84% for identifying eating and drinking gestures. Additionally, a self-collected dataset using smartphone recordings in more adaptable experimental conditions was evaluated with the model trained on OREBA, yielding F1-scores of 85.40% and 67.80% for detecting eating and drinking gestures. The results not only confirm the feasibility of utilizing skeleton data for intake gesture detection but also highlight the robustness of the proposed approach in cross-dataset validation.

Abstract (translated)

超重和肥胖已成为社会面临的广泛挑战,往往与不健康的饮食习惯相关。一种改善日常生活中膳食监测的有希望的方法是自动检测进食姿势。本研究提出了一种基于骨架的方法,该方法结合了膨胀的空间-时间图卷积网络(ST-GCN)与双向长短期记忆框架(BiLSTM),命名为ST-GCN-BiLSTM,用于检测摄入手势。基于骨架的方法提供了关键优势,包括环境鲁棒性、减少数据依赖性和增强隐私保护。 为了验证模型的性能,采用了两个数据集。OREBA数据集由实验室录制的视频组成,在识别进食和饮水手势方面实现了片段F1分数分别为86.18%和74.84%。此外,还使用了自采集的数据集,该数据集利用智能手机在更灵活的实验条件下记录,并且在模型训练于OREBA后对该数据集进行了评估,检测到进食和饮水手势的F1分数分别为85.40%和67.80%。 研究结果不仅确认了利用骨架数据进行摄入姿势检测的可行性,还强调了所提出方法在跨数据集验证中的鲁棒性。

URL

https://arxiv.org/abs/2504.10635

PDF

https://arxiv.org/pdf/2504.10635.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot