Paper Reading AI Learner

Spatiotemporal Knowledge Distillation for Efficient Estimation of Aerial Video Saliency

2019-04-10 03:41:10
Jia Li, Kui Fu, Shengwei Zhao, Shiming Ge

Abstract

The performance of video saliency estimation techniques has achieved significant advances along with the rapid development of Convolutional Neural Networks (CNNs). However, devices like cameras and drones may have limited computational capability and storage space so that the direct deployment of complex deep saliency models becomes infeasible. To address this problem, this paper proposes a dynamic saliency estimation approach for aerial videos via spatiotemporal knowledge distillation. In this approach, five components are involved, including two teachers, two students and the desired spatiotemporal model. The knowledge of spatial and temporal saliency is first separately transferred from the two complex and redundant teachers to their simple and compact students, and the input scenes are also degraded from high-resolution to low-resolution to remove the probable data redundancy so as to greatly speed up the feature extraction process. After that, the desired spatiotemporal model is further trained by distilling and encoding the spatial and temporal saliency knowledge of two students into a unified network. In this manner, the inter-model redundancy can be further removed for the effective estimation of dynamic saliency on aerial videos. Experimental results show that the proposed approach outperforms ten state-of-the-art models in estimating visual saliency on aerial videos, while its speed reaches up to 28,738 FPS on the GPU platform.

Abstract (translated)

随着卷积神经网络(CNN)的迅速发展,视频显著性估计技术的性能也取得了显著的进步。然而,像照相机和无人机这样的设备可能具有有限的计算能力和存储空间,因此直接部署复杂的深度显著性模型变得不可行。针对这一问题,本文提出了一种基于时空知识蒸馏的航空视频动态显著性估计方法。这种方法涉及五个部分,包括两名教师、两名学生和所需的时空模型。首先将空间和时间显著性的知识从两个复杂和冗余的教师分别转移到简单和紧凑的学生身上,并将输入场景从高分辨率退化到低分辨率,以消除可能的数据冗余,从而大大加快特征提取过程。然后将两个学生的时空显著性知识提取编码到一个统一的网络中,进一步训练出所需的时空模型。通过这种方法,可以进一步去除模型间的冗余度,从而有效地估计航空视频的动态显著性。实验结果表明,该方法在估计航空视频的视觉显著性方面优于十种最先进的模型,在GPU平台上其速度可达28738fps。

URL

https://arxiv.org/abs/1904.04992

PDF

https://arxiv.org/pdf/1904.04992.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot