Paper Reading AI Learner

What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention

2019-05-22 09:25:25
Antonino Furnari, Giovanni Maria Farinella

Abstract

Egocentric action anticipation consists in understanding which objects the camera wearer will interact with in the near future and which actions they will perform. We tackle the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to 1) summarize the past, and 2) formulate predictions about the future. The input video is processed considering three complimentary modalities: appearance (RGB), motion (optical flow) and objects (object-based features). Modality-specific predictions are fused using a novel Modality ATTention (MATT) mechanism which learns to weigh modalities in an adaptive fashion. Extensive evaluations on two large-scale benchmark datasets show that our method outperforms prior art by up to +7% on the challenging EPIC-KITCHENS dataset including more than 2500 actions, and generalizes to EGTEA Gaze+. Our approach is also shown to generalize to the tasks of early action recognition and action recognition. At the moment of submission, our method is ranked first in the leaderboard of the EPIC-KITCHENS egocentric action anticipation challenge.

Abstract (translated)

以自我为中心的行为预期包括理解佩戴相机的人在不久的将来会与哪些对象互动,以及他们将执行哪些行为。我们解决了这个问题,提出了一种架构,能够使用两个LSTM在多个时间尺度上预测行动:1)总结过去,2)对未来进行预测。输入视频的处理考虑了三种免费的方式:外观(RGB)、运动(光流)和对象(基于对象的功能)。模态特定的预测融合使用一种新的模态注意(matt)机制,该机制学习以自适应方式权衡模态。对两个大型基准数据集的广泛评估表明,我们的方法在具有挑战性的Epic-Kitchens数据集(包括2500多个动作)上优于现有技术高达+7%,并概括为egtea-gaze+。我们的方法也可以概括为早期行动识别和行动识别的任务。在提交时,我们的方法在史诗厨房自我中心行动预期挑战排行榜中排名第一。

URL

https://arxiv.org/abs/1905.09035

PDF

https://arxiv.org/pdf/1905.09035.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot