Abstract
The performance of video saliency estimation techniques has achieved significant advances along with the rapid development of Convolutional Neural Networks (CNNs). However, devices like cameras and drones may have limited computational capability and storage space so that the direct deployment of complex deep saliency models becomes infeasible. To address this problem, this paper proposes a dynamic saliency estimation approach for aerial videos via spatiotemporal knowledge distillation. In this approach, five components are involved, including two teachers, two students and the desired spatiotemporal model. The knowledge of spatial and temporal saliency is first separately transferred from the two complex and redundant teachers to their simple and compact students, and the input scenes are also degraded from high-resolution to low-resolution to remove the probable data redundancy so as to greatly speed up the feature extraction process. After that, the desired spatiotemporal model is further trained by distilling and encoding the spatial and temporal saliency knowledge of two students into a unified network. In this manner, the inter-model redundancy can be further removed for the effective estimation of dynamic saliency on aerial videos. Experimental results show that the proposed approach outperforms ten state-of-the-art models in estimating visual saliency on aerial videos, while its speed reaches up to 28,738 FPS on the GPU platform.
Abstract (translated)
随着卷积神经网络(CNN)的迅速发展,视频显著性估计技术的性能也取得了显著的进步。然而,像照相机和无人机这样的设备可能具有有限的计算能力和存储空间,因此直接部署复杂的深度显著性模型变得不可行。针对这一问题,本文提出了一种基于时空知识蒸馏的航空视频动态显著性估计方法。这种方法涉及五个部分,包括两名教师、两名学生和所需的时空模型。首先将空间和时间显著性的知识从两个复杂和冗余的教师分别转移到简单和紧凑的学生身上,并将输入场景从高分辨率退化到低分辨率,以消除可能的数据冗余,从而大大加快特征提取过程。然后将两个学生的时空显著性知识提取编码到一个统一的网络中,进一步训练出所需的时空模型。通过这种方法,可以进一步去除模型间的冗余度,从而有效地估计航空视频的动态显著性。实验结果表明,该方法在估计航空视频的视觉显著性方面优于十种最先进的模型,在GPU平台上其速度可达28738fps。
URL
https://arxiv.org/abs/1904.04992