Paper Reading AI Learner

Deep Architectures and Ensembles for Semantic Video Classification

2018-07-03 08:49:47
Eng-Jon Ong, Sameed Husain, Mikel Bober, Miroslaw Bober

Abstract

This work addresses the problem of accurate semantic labelling of short videos. We advance the state of the art by proposing a new residual architecture, with state-of-the art classification performance at significantly reduced complexity. Further, we propose four new approaches to diversity-driven multi-net ensembling, one based on fast correlation measure and three incorporating a DNN-based combiner. We show that significant performance gains can be achieved by "clever" ensembling of diverse nets and we investigate factors contributing to high diversity. Based on the extensive YouTube8M dataset, we perform a detailed evaluation of a broad range of deep architectures, including designs based on recurrent networks (RNN), feature space aggregation (FV, VLAD, BoW), simple statistical aggregation, mid-stage AV fusion and others, presenting for the first time an in-depth evaluation and analysis of their behaviour.

Abstract (translated)

这项工作解决了短视频的准确语义标签问题。我们通过提出一种新的剩余架构来推进最先进的技术,其最先进的分类性能显着降低了复杂性。此外,我们提出了四种新的多样性驱动的多网络集成方法,一种基于快速相关性测量,另一种采用基于DNN的组合器。我们表明,通过对各种网络的“巧妙”整合可以实现显着的性能提升,并且我们研究了导致高度多样性的因素。基于广泛的YouTube8M数据集,我们对广泛的深层架构进行了详细评估,包括基于循环网络(RNN),特征空间聚合(FV,VLAD,BoW)的设计,简单的统计聚合,中期AV融合和其他人,首次对他们的行为进行深入评估和分析。

URL

https://arxiv.org/abs/1807.01026

PDF

https://arxiv.org/pdf/1807.01026.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot