Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition

Abstract
Abstract (translated)
URL
PDF

Abstract

3D convolution neural networks (CNNs) have been the prevailing option for video recognition. To capture the temporal information, 3D convolutions are computed along the sequences, leading to cubically growing and expensive computations. To reduce the computational cost, previous methods resort to manually designed 3D/2D CNN structures with approximations or automatic search, which sacrifice the modeling ability or make training time-consuming. In this work, we propose to automatically design efficient 3D CNN architectures via a novel training-free neural architecture search approach tailored for 3D CNNs considering the model complexity. To measure the expressiveness of 3D CNNs efficiently, we formulate a 3D CNN as an information system and derive an analytic entropy score, based on the Maximum Entropy Principle. Specifically, we propose a spatio-temporal entropy score (STEntr-Score) with a refinement factor to handle the discrepancy of visual information in spatial and temporal dimensions, through dynamically leveraging the correlation between the feature map size and kernel size depth-wisely. Highly efficient and expressive 3D CNN architectures, \ie entropy-based 3D CNNs (E3D family), can then be efficiently searched by maximizing the STEntr-Score under a given computational budget, via an evolutionary algorithm without training the network parameters. Extensive experiments on Something-Something V1\&V2 and Kinetics400 demonstrate that the E3D family achieves state-of-the-art performance with higher computational efficiency. Code is available at this https URL.

Abstract (translated)

3D卷积神经网络(CNN)已经成为视频识别的主要选择。为了捕获时间信息,3D卷积在序列中计算,导致立方增长且计算成本增加。为了降低计算成本,以前的方法和手动设计的3D/2D CNN结构以及自动搜索,都依赖于近似或自动搜索,牺牲了建模能力或使训练时间变得漫长。在本文中,我们提议通过一种专门为3D CNN设计的无训练的神经网络架构搜索方法,考虑模型复杂性,开发一种高效、富有表现力的3D CNN架构。为了有效地测量3D CNN的表达力,我们将其定义成一个信息系统,并基于最大熵原则推导出Analytic Entropy Score。具体来说,我们提议一个空间时间熵得分(STEntr-Score),并添加一个改进因子,以处理空间时间和维度的视觉信息差异,通过动态地利用特征映射大小和内核大小的Depthwisely相关关系。在一些关于Something-Something V1&V2和Kinetics400的实验中,广泛证明了E3D家族(E3D family)以更高效的计算效率实现了最先进的性能。代码可在本网站 https URL 中获取。

URL

https://arxiv.org/abs/2303.02693

PDF

https://arxiv.org/pdf/2303.02693.pdf