Paper Reading AI Learner

Fast Semantic Segmentation on Video Using Motion Vector-Based Feature Interpolation

2018-07-06 23:58:55
Samvit Jain, Joseph E. Gonzalez

Abstract

Models optimized for accuracy on challenging, dense prediction tasks such as semantic segmentation entail significant inference costs, and are prohibitively slow to run on each frame in a video. Since nearby video frames are spatially similar, however, there is substantial opportunity to reuse computation. Existing work has explored basic feature reuse and feature warping based on optical flow, but has encountered limits to the speedup attainable with these techniques. In this paper, we present a new, two part approach to accelerating inference on video. Firstly, we propose a fast feature propagation scheme that utilizes the block motion vector maps present in compressed video to cheaply propagate features from frame to frame. Secondly, we develop a novel feature estimation scheme, termed feature interpolation, that fuses features propagated from enclosing keyframes to render accurate feature estimates, even at sparse keyframe frequencies. We evaluate our system on the Cityscapes and CamVid datasets, comparing to both a frame-by-frame baseline and related work. We find that we are able to substantially accelerate segmentation on video, achieving twice the average inference speed as prior work at any target accuracy level.

Abstract (translated)

针对具有挑战性的密集预测任务(例如语义分段)的精确度优化的模型需要显着的推理成本,并且在视频中的每个帧上运行速度极慢。然而,由于附近的视频帧在空间上相似,因此存在重用计算的大量机会。现有工作已经探索了基于光流的基本特征重用和特征变形,但是已经遇到了使用这些技术可获得的加速的限制。在本文中,我们提出了一种新的,两部分加速视频推理的方法。首先,我们提出了一种快速特征传播方案,该方案利用压缩视频中存在的块运动矢量图来逐帧地廉价地传播特征。其次,我们开发了一种新颖的特征估计方案,称为特征插值,融合了从封闭关键帧传播的特征,以呈现精确的特征估计,即使在稀疏关键帧频率下也是如此。我们在Cityscapes和CamVid数据集上评估我们的系统,与逐帧基线和相关工作进行比较。我们发现,我们能够大幅加速视频分割,实现平均推理速度的两倍,可以达到任何目标精度等级的先前工作。

URL

https://arxiv.org/abs/1803.07742

PDF

https://arxiv.org/pdf/1803.07742.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot