Paper Reading AI Learner

Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization

2018-07-27 16:27:25
Humam Alwassel, Fabian Caba Heilbron, Bernard Ghanem

Abstract

State-of-the-art temporal action detectors inefficiently search the entire video for specific actions. Despite the encouraging progress these methods achieve, it is crucial to design automated approaches that only explore parts of the video which are the most relevant to the actions being searched for. To address this need, we propose the new problem of action spotting in video, which we define as finding a specific action in a video while observing a small portion of that video. Inspired by the observation that humans are extremely efficient and accurate in spotting and finding action instances in video, we propose Action Search, a novel Recurrent Neural Network approach that mimics the way humans spot actions. Moreover, to address the absence of data recording the behavior of human annotators, we put forward the Human Searches dataset, which compiles the search sequences employed by human annotators spotting actions in the AVA and THUMOS14 datasets. We consider temporal action localization as an application of the action spotting problem. Experiments on the THUMOS14 dataset reveal that our model is not only able to explore the video efficiently (observing on average 17.3% of the video) but it also accurately finds human activities with 30.8% mAP.

Abstract (translated)

现有技术的时间动作检测器低效地搜索整个视频以寻找特定动作。尽管这些方法取得了令人鼓舞的进展,但设计自动化方法至关重要,这些方法仅探索与所搜索行为最相关的视频部分。为了满足这一需求,我们提出了视频中动作发现的新问题,我们将其定义为在视频中查找特定动作,同时观察视频的一小部分。受到人类在视频中发现和查找动作实例的极其高效和准确的观察的启发,我们提出了动作搜索,一种新颖的回归神经网络方法,模仿人类发现行动的方式。此外,为了解决没有记录人类注释器行为的数据,我们提出了人类搜索数据集,该数据集编译了人类注释器在AVA和THUMOS14数据集中发现动作所使用的搜索序列。我们将时间动作定位视为动作定位问题的应用。 THUMOS14数据集上的实验表明,我们的模型不仅能够有效地探索视频(平均观察视频的17.3%),而且还可以准确地找到30.8%mAP的人类活动。

URL

https://arxiv.org/abs/1706.04269

PDF

https://arxiv.org/pdf/1706.04269.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot