Abstract
Humans utilize their gaze to concentrate on essential information while perceiving and interpreting intentions in videos. Incorporating human gaze into computational algorithms can significantly enhance model performance in video understanding tasks. In this work, we address a challenging and innovative task in video understanding: predicting the actions of an agent in a video based on a partial video. We introduce the Gaze-guided Action Anticipation algorithm, which establishes a visual-semantic graph from the video input. Our method utilizes a Graph Neural Network to recognize the agent's intention and predict the action sequence to fulfill this intention. To assess the efficiency of our approach, we collect a dataset containing household activities generated in the VirtualHome environment, accompanied by human gaze data of viewing videos. Our method outperforms state-of-the-art techniques, achieving a 7\% improvement in accuracy for 18-class intention recognition. This highlights the efficiency of our method in learning important features from human gaze data.
Abstract (translated)
人类利用他们的目光来集中注意力于视频中的关键信息,同时通过观察和解释视频中的意图。将人类目光纳入计算算法可以显著提高视频理解任务的模型性能。在这项工作中,我们解决了一个具有挑战性和创新性的视频理解任务:根据视频的片段预测代理的行动。我们引入了 gaze-guided action anticipation 算法,该算法基于视频输入建立了一个视觉语义图。我们的方法利用图神经网络识别代理的意图并预测实现这一意图的动作序列。为了评估我们的方法的效率,我们收集了一个虚拟家庭环境中生成的生活活动数据集,并附有观看视频的人类目光数据。我们的方法超越了最先进的 techniques,实现了18类意图识别的准确度提高了7%。这突出了我们方法从人类目光数据中学习重要特征的有效性。
URL
https://arxiv.org/abs/2404.07347