Communication-efficient distributed SGD with Sketching

Abstract
Abstract (translated)
URL
PDF

Abstract

Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we propose a sketching-based approach to minimize the communication costs between nodes without losing accuracy. In our proposed method, workers in a distributed, synchronous training setting send sketches of their gradient vectors to the parameter server instead of the full gradient vector. Leveraging the theoretical properties of sketches, we show that this method recovers the favorable convergence guarantees of single-machine top-$k$ SGD. Furthermore, when applied to a model with $d$ dimensions on $W$ workers, our method requires only $\Theta(kW)$ bytes of communication, compared to $\Omega(dW)$ for vanilla distributed SGD. To validate our method, we run experiments using a residual network trained on the CIFAR-10 dataset. We achieve no drop in validation accuracy with a compression ratio of 4, or about 1 percentage point drop with a compression ratio of 8. We also demonstrate that our method scales to many workers.

Abstract (translated)

神经网络的大规模分布式训练往往受到网络带宽的限制，其中通信时间超过了局部计算时间。由于绘制方法在次线性/流算法中的成功，我们提出了一种基于绘制的方法，在不损失精度的情况下，最大限度地降低节点之间的通信成本。在我们提出的方法中，分布式同步训练设置中的工作人员将梯度向量的草图发送到参数服务器，而不是整个梯度向量。利用草图的理论性质，证明了该方法恢复了单台机器top的良好收敛性保证——K$sgd。此外，当应用于$w$workers上的$d$dimensions模型时，我们的方法只需要$theta（kw）$字节的通信，而对于普通的分布式SGD，则需要$omega（dw）$字节。为了验证我们的方法，我们使用一个在cifar-10数据集上训练的残差网络来运行实验。压缩比为4时，验证精度不会下降，压缩比为8时，验证精度不会下降约1个百分点。我们还证明了我们的方法适用于许多工人。

URL

https://arxiv.org/abs/1903.04488

PDF

https://arxiv.org/pdf/1903.04488.pdf