Abstract
This work addresses the problem of accurate semantic labelling of short videos. We advance the state of the art by proposing a new residual architecture, with state-of-the art classification performance at significantly reduced complexity. Further, we propose four new approaches to diversity-driven multi-net ensembling, one based on fast correlation measure and three incorporating a DNN-based combiner. We show that significant performance gains can be achieved by "clever" ensembling of diverse nets and we investigate factors contributing to high diversity. Based on the extensive YouTube8M dataset, we perform a detailed evaluation of a broad range of deep architectures, including designs based on recurrent networks (RNN), feature space aggregation (FV, VLAD, BoW), simple statistical aggregation, mid-stage AV fusion and others, presenting for the first time an in-depth evaluation and analysis of their behaviour.
Abstract (translated)
这项工作解决了短视频的准确语义标签问题。我们通过提出一种新的剩余架构来推进最先进的技术,其最先进的分类性能显着降低了复杂性。此外,我们提出了四种新的多样性驱动的多网络集成方法,一种基于快速相关性测量,另一种采用基于DNN的组合器。我们表明,通过对各种网络的“巧妙”整合可以实现显着的性能提升,并且我们研究了导致高度多样性的因素。基于广泛的YouTube8M数据集,我们对广泛的深层架构进行了详细评估,包括基于循环网络(RNN),特征空间聚合(FV,VLAD,BoW)的设计,简单的统计聚合,中期AV融合和其他人,首次对他们的行为进行深入评估和分析。
URL
https://arxiv.org/abs/1807.01026