A Convergence Analysis of Distributed SGD with Communication-Efficient
Gradient Sparsification
Abstract
Gradient sparsification is a promising technique
to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e.g., Top-k
sparsification) have a communication complexity
of O(kP), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme
has been proposed to reduce the communication
complexity from O(kP) to O(k log P), which significantly boosts the system scalability. However,
it remains unclear whether the gTop-k sparsification scheme can converge in theory. In this paper,
we first provide theoretical proofs on the convergence of the gTop-k scheme for non-convex objective functions under certain analytic assumptions.
We then derive the convergence rate of gTop-k SSGD, which is at the same order as the vanilla minibatch SGD. Finally, we conduct extensive experiments on different machine learning models and
data sets to verify the soundness of the assumptions
and theoretical results, and discuss the impact of the
compression ratio on the convergence performance