Paper Reading AI Learner

Investigation on Combining 3D Convolution of Image Data and Optical Flow to Generate Temporal Action Proposals

2019-03-11 08:55:02
Patrick Schlosser, David Münch, Michael Arens

Abstract

In this paper, a novel two-stream architecture for the task of temporal action proposal generation in long, untrimmed videos is presented. Inspired by the recent advances in the field of human action recognition utilizing 3D convolutions in combination with two-stream networks and based on the Single-Stream Temporal Action Proposals (SST) architecture, four different two-stream architectures utilizing sequences of images on one stream and images of optical flow on the other stream are subsequently investigated. The four architectures fuse the two separate streams at different depths in the model; for each of them, a broad range of parameters is investigated systematically as well as an optimal parametrization is empirically determined. The experiments on action and sports datasets show that all four two-stream architectures are able to outperform the original single-stream SST and achieve state of the art results. Additional experiments revealed that the improvements are not restricted to a single method of calculating optical flow by exchanging the formerly used method of Brox with FlowNet2 and still achieving improvements.

Abstract (translated)

本文提出了一种新的两流体系结构,用于在长时间、未经修剪的视频中生成时间动作建议。受最近人类行动识别领域的发展启发,利用三维卷积结合两个流网络,并基于单流时间行动方案(SST)架构,四种不同的双流架构利用一个流上的图像序列和另一个流上的光流图像,并将其应用于人类行动识别领域。E随后进行了调查。这四种结构在模型中融合了两种不同深度的独立流,对每种结构系统地研究了广泛的参数范围,并根据经验确定了最佳参数化。对动作和运动数据集的实验表明,四种双流体系结构都能优于原单流SST,达到了最新的效果。另外的实验表明,这种改进并不局限于通过将先前使用的brox方法与flownet2进行交换来计算光流量的单一方法,并且仍在取得改进。

URL

https://arxiv.org/abs/1903.04176

PDF

https://arxiv.org/pdf/1903.04176.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot