Abstract
Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. SIGN SGD alleviates this problem by transmitting just the sign of each minibatc stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. The relative1 /2 geometry of gradients, noise and curvature informs whether SIGN SGD or SGD is theoretically better suited to a particular problem. On the prac tical side we find that the momentum counterpart of SIGN SGD is able to match the accuracy and convergence speed of A DAM on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss (1823) we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence. Code to reproduce experiments is to be found at https://github.com/jxbz/signSGD.