Abstract
Visual Question-Answering, a technology that generates textual responses from an image and natural language question, has progressed significantly. Notably, it can aid in tracking and inquiring about daily activities, crucial in healthcare monitoring, especially for elderly patients or those with memory disabilities. However, video poses privacy concerns and has a limited field of view. This paper presents Sensor2Text, a model proficient in tracking daily activities and engaging in conversations using wearable sensors. The approach outlined here tackles several challenges, including low information density in wearable sensor data, insufficiency of single wearable sensors in human activities recognition, and model's limited capacity for Question-Answering and interactive conversations. To resolve these obstacles, transfer learning and student-teacher networks are utilized to leverage knowledge from visual-language models. Additionally, an encoder-decoder neural network model is devised to jointly process language and sensor data for conversational purposes. Furthermore, Large Language Models are also utilized to enable interactive capabilities. The model showcases the ability to identify human activities and engage in Q\&A dialogues using various wearable sensor modalities. It performs comparably to or better than existing visual-language models in both captioning and conversational tasks. To our knowledge, this represents the first model capable of conversing about wearable sensor data, offering an innovative approach to daily activity tracking that addresses privacy and field-of-view limitations associated with current vision-based solutions.
Abstract (translated)
视觉问答技术,一种从图像和自然语言问题生成文本响应的技术,已经取得了显著进步。值得注意的是,它可以帮助追踪和询问日常活动,在医疗监测中尤其重要,特别是对于老年患者或有记忆障碍的病人。然而,视频会引发隐私问题,并且视野有限。本文介绍了一种名为Sensor2Text的模型,该模型擅长使用可穿戴传感器跟踪日常活动并进行对话。这里提出的方法解决了几个挑战,包括可穿戴传感器数据信息密度低、单个可穿戴传感器在人类活动识别中的不足以及问答和互动对话能力的限制。为了解决这些障碍,利用迁移学习和师生网络从视觉-语言模型中获取知识。此外,还设计了一种编码器-解码器神经网络模型来联合处理语言和传感器数据以用于对话目的。另外,大型语言模型也被用来实现交互功能。该模型展示了识别人类活动并使用多种可穿戴传感模式进行问答对话的能力,在描述任务和对话任务中表现得与现有视觉-语言模型相当或更好。据我们所知,这是首个能够就可穿戴传感器数据进行对话的模型,提供了一种创新方法来跟踪日常活动,并解决了当前基于视觉解决方案相关的隐私和视野限制问题。
URL
https://arxiv.org/abs/2410.20034