Abstract
Egocentric action anticipation consists in understanding which objects the camera wearer will interact with in the near future and which actions they will perform. We tackle the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to 1) summarize the past, and 2) formulate predictions about the future. The input video is processed considering three complimentary modalities: appearance (RGB), motion (optical flow) and objects (object-based features). Modality-specific predictions are fused using a novel Modality ATTention (MATT) mechanism which learns to weigh modalities in an adaptive fashion. Extensive evaluations on two large-scale benchmark datasets show that our method outperforms prior art by up to +7% on the challenging EPIC-KITCHENS dataset including more than 2500 actions, and generalizes to EGTEA Gaze+. Our approach is also shown to generalize to the tasks of early action recognition and action recognition. At the moment of submission, our method is ranked first in the leaderboard of the EPIC-KITCHENS egocentric action anticipation challenge.
Abstract (translated)
以自我为中心的行为预期包括理解佩戴相机的人在不久的将来会与哪些对象互动,以及他们将执行哪些行为。我们解决了这个问题,提出了一种架构,能够使用两个LSTM在多个时间尺度上预测行动:1)总结过去,2)对未来进行预测。输入视频的处理考虑了三种免费的方式:外观(RGB)、运动(光流)和对象(基于对象的功能)。模态特定的预测融合使用一种新的模态注意(matt)机制,该机制学习以自适应方式权衡模态。对两个大型基准数据集的广泛评估表明,我们的方法在具有挑战性的Epic-Kitchens数据集(包括2500多个动作)上优于现有技术高达+7%,并概括为egtea-gaze+。我们的方法也可以概括为早期行动识别和行动识别的任务。在提交时,我们的方法在史诗厨房自我中心行动预期挑战排行榜中排名第一。
URL
https://arxiv.org/abs/1905.09035