Abstract
There currently exist two extreme viewpoints for neural network feature learning -- (i) Neural networks simply implement a kernel method (a la NTK) and hence no features are learned (ii) Neural networks can represent (and hence learn) intricate hierarchical features suitable for the data. We argue in this paper neither interpretation is likely to be correct based on a novel viewpoint. Neural networks can be viewed as a mixture of experts, where each expert corresponds to a (number of layers length) path through a sequence of hidden units. We use this alternate interpretation to motivate a model, called the Deep Linearly Gated Network (DLGN), which sits midway between deep linear networks and ReLU networks. Unlike deep linear networks, the DLGN is capable of learning non-linear features (which are then linearly combined), and unlike ReLU networks these features are ultimately simple -- each feature is effectively an indicator function for a region compactly described as an intersection of (number of layers) half-spaces in the input space. This viewpoint allows for a comprehensive global visualization of features, unlike the local visualizations for neurons based on saliency/activation/gradient maps. Feature learning in DLGNs is shown to happen and the mechanism with which this happens is through learning half-spaces in the input space that contain smooth regions of the target function. Due to the structure of DLGNs, the neurons in later layers are fundamentally the same as those in earlier layers -- they all represent a half-space -- however, the dynamics of gradient descent impart a distinct clustering to the later layer neurons. We hypothesize that ReLU networks also have similar feature learning behaviour.
Abstract (translated)
目前存在两种极端观点用于神经网络特征学习:(i)神经网络简单地实现核方法(类似于NTK),因此没有特征被学习;(ii)神经网络可以表示(因此可以学习)适合数据的有层次特征。在本文中,我们认为基于一种新颖的观点,这两种解释都不太可能正确。神经网络可以看作是一个专家的混合,每个专家对应于一个(层数长度)通过隐藏单元的序列路径。我们使用这种替代解释来激励一个模型,称为深度线性有门网络(DLGN),该模型处于深度线性网络和ReLU网络之间。与深度线性网络不同,DLGN能够学习非线性特征(然后将这些特征进行线性组合),与ReLU网络不同,这些特征最终是简单的——每个特征实际上是输入空间中(层数)半空间的指示函数。这种观点允许对特征进行全面的全局可视化,而不仅仅是基于局部可视化对神经元的可视化。DLGN中的特征学习已经被证明是存在的,而且是通过在输入空间中学习包含目标函数平滑区域的半空间来实现的。由于DLGN的结构,后层的神经元与前层的神经元本质上相同——它们都代表一个半空间——然而,梯度下降的动态使后层神经元的聚类特征更加明显。我们假设ReLU网络也具有类似特征学习行为。
URL
https://arxiv.org/abs/2404.04312