Paper Reading AI Learner

ECO: Efficient Convolutional Network for Online Video Understanding

2018-05-07 09:46:08
Mohammadreza Zolfaghari, Kamaljeet Singh, Thomas Brox

Abstract

The state of the art in video understanding suffers from two problems: (1) The major part of reasoning is performed locally in the video, therefore, it misses important relationships within actions that span several seconds. (2) While there are local methods with fast per-frame processing, the processing of the whole video is not efficient and hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with a sampling strategy, which exploits that neighboring frames are largely redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video can consist of a few hundred frames. The approach achieves competitive performance across all datasets while being 10x to 80x faster than state-of-the-art methods.

Abstract (translated)

视频理解技术的最新进展存在两个问题:(1)推理的主要部分是在视频中本地执行的,因此,它忽略了跨越几秒钟的动作中的重要关系。 (2)尽管本地方法具有快速的每帧处理,但整个视频的处理效率并不高,并且妨碍了快速视频检索或长期活动的在线分类。在本文中,我们介绍一种考虑长期内容的网络体系结构,并可同时实现快速的每个视频处理。该体系结构基于将网络中已有的长期内容进行合并而不是进行事后融合。再加上采用相邻帧的采样策略在很大程度上是多余的,这可以产生高质量的动作分类和视频字幕,每秒高达230个视频,其中每个视频可以由几百帧组成。该方法实现了所有数据集的竞争性表现,而速度比现有技术快10倍至80倍。

URL

https://arxiv.org/abs/1804.09066

PDF

https://arxiv.org/pdf/1804.09066.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot