Abstract
Overweight and obesity have emerged as widespread societal challenges, frequently linked to unhealthy eating patterns. A promising approach to enhance dietary monitoring in everyday life involves automated detection of food intake gestures. This study introduces a skeleton based approach using a model that combines a dilated spatial-temporal graph convolutional network (ST-GCN) with a bidirectional long-short-term memory (BiLSTM) framework, as called ST-GCN-BiLSTM, to detect intake gestures. The skeleton-based method provides key benefits, including environmental robustness, reduced data dependency, and enhanced privacy preservation. Two datasets were employed for model validation. The OREBA dataset, which consists of laboratory-recorded videos, achieved segmental F1-scores of 86.18% and 74.84% for identifying eating and drinking gestures. Additionally, a self-collected dataset using smartphone recordings in more adaptable experimental conditions was evaluated with the model trained on OREBA, yielding F1-scores of 85.40% and 67.80% for detecting eating and drinking gestures. The results not only confirm the feasibility of utilizing skeleton data for intake gesture detection but also highlight the robustness of the proposed approach in cross-dataset validation.
Abstract (translated)
超重和肥胖已成为社会面临的广泛挑战,往往与不健康的饮食习惯相关。一种改善日常生活中膳食监测的有希望的方法是自动检测进食姿势。本研究提出了一种基于骨架的方法,该方法结合了膨胀的空间-时间图卷积网络(ST-GCN)与双向长短期记忆框架(BiLSTM),命名为ST-GCN-BiLSTM,用于检测摄入手势。基于骨架的方法提供了关键优势,包括环境鲁棒性、减少数据依赖性和增强隐私保护。 为了验证模型的性能,采用了两个数据集。OREBA数据集由实验室录制的视频组成,在识别进食和饮水手势方面实现了片段F1分数分别为86.18%和74.84%。此外,还使用了自采集的数据集,该数据集利用智能手机在更灵活的实验条件下记录,并且在模型训练于OREBA后对该数据集进行了评估,检测到进食和饮水手势的F1分数分别为85.40%和67.80%。 研究结果不仅确认了利用骨架数据进行摄入姿势检测的可行性,还强调了所提出方法在跨数据集验证中的鲁棒性。
URL
https://arxiv.org/abs/2504.10635