Paper Reading AI Learner

Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable Sensors

2024-10-26 01:03:13
Wenqiang Chen, Jiaxuan Cheng, Leyao Wang, Wei Zhao, Wojciech Matusik

Abstract

Visual Question-Answering, a technology that generates textual responses from an image and natural language question, has progressed significantly. Notably, it can aid in tracking and inquiring about daily activities, crucial in healthcare monitoring, especially for elderly patients or those with memory disabilities. However, video poses privacy concerns and has a limited field of view. This paper presents Sensor2Text, a model proficient in tracking daily activities and engaging in conversations using wearable sensors. The approach outlined here tackles several challenges, including low information density in wearable sensor data, insufficiency of single wearable sensors in human activities recognition, and model's limited capacity for Question-Answering and interactive conversations. To resolve these obstacles, transfer learning and student-teacher networks are utilized to leverage knowledge from visual-language models. Additionally, an encoder-decoder neural network model is devised to jointly process language and sensor data for conversational purposes. Furthermore, Large Language Models are also utilized to enable interactive capabilities. The model showcases the ability to identify human activities and engage in Q\&A dialogues using various wearable sensor modalities. It performs comparably to or better than existing visual-language models in both captioning and conversational tasks. To our knowledge, this represents the first model capable of conversing about wearable sensor data, offering an innovative approach to daily activity tracking that addresses privacy and field-of-view limitations associated with current vision-based solutions.

Abstract (translated)

视觉问答技术,一种从图像和自然语言问题生成文本响应的技术,已经取得了显著进步。值得注意的是,它可以帮助追踪和询问日常活动,在医疗监测中尤其重要,特别是对于老年患者或有记忆障碍的病人。然而,视频会引发隐私问题,并且视野有限。本文介绍了一种名为Sensor2Text的模型,该模型擅长使用可穿戴传感器跟踪日常活动并进行对话。这里提出的方法解决了几个挑战,包括可穿戴传感器数据信息密度低、单个可穿戴传感器在人类活动识别中的不足以及问答和互动对话能力的限制。为了解决这些障碍,利用迁移学习和师生网络从视觉-语言模型中获取知识。此外,还设计了一种编码器-解码器神经网络模型来联合处理语言和传感器数据以用于对话目的。另外,大型语言模型也被用来实现交互功能。该模型展示了识别人类活动并使用多种可穿戴传感模式进行问答对话的能力,在描述任务和对话任务中表现得与现有视觉-语言模型相当或更好。据我们所知,这是首个能够就可穿戴传感器数据进行对话的模型,提供了一种创新方法来跟踪日常活动,并解决了当前基于视觉解决方案相关的隐私和视野限制问题。

URL

https://arxiv.org/abs/2410.20034

PDF

https://arxiv.org/pdf/2410.20034.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot