A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks
A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks
Distributed synchronous stochastic gradient descent (S-SGD) with data parallelism has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to exchange gradients iteratively. Recently, Top-k sparsification techniques have been proposed to reduce the volume of data …