On Exact Computation with an Infinitely Wide Neural Net

Abstract
Abstract (translated)
URL
PDF

Abstract

How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its "width" --- namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers --- is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. A subsequent paper [Lee et al., 2019] gave heuristic Monte Carlo methods to estimate the NTK and its extension, Convolutional Neural Tangent Kernel (CNTK) and used this to try to understand the limiting behavior on datasets like CIFAR-10. The current paper gives the first efficient exact algorithm (based upon dynamic programming) for computing CNTK as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019], and only 5% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). We give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK. Our experiments also demonstrate that earlier Monte Carlo approximation can degrade the performance significantly, thus highlighting the power of our exact kernel computation, which we have applied even to the full CIFAR-10 dataset and 20-layer nets.

Abstract (translated)

像Alexnet或VGG19这样的经典深网架构，当其“宽度”（即卷积层中的通道数和完全连接的内部层中的节点数）允许增加到无穷大时，它在标准数据集（如CIFAR-10）上的分类有多好？这些问题在理论上理解深度学习及其优化和泛化的奥秘的探索中已经走到了最前沿。它们还将深度学习与高斯过程和内核等概念联系起来。最近的一篇论文[Jacot等人，2018]介绍了神经切线核（NTK），它捕获了在梯度下降训练的无限宽度限制中完全连接的深网的行为；该对象在最近的一些论文中是隐式的。随后的一篇论文[Lee等人，2019]给出了启发式蒙特卡罗方法来估计NTK及其扩展，卷积神经切线核（CNTK），并以此来尝试理解数据集（如CIFAR-10）上的限制行为。

URL

https://arxiv.org/abs/1904.11955

PDF

https://arxiv.org/pdf/1904.11955.pdf