Abstract
Variance reduction methods which use a mixture of large and small batch gradients, such as SVRG (Johnson & Zhang, 2013) and SpiderBoost (Wang et al., 2018), require significantly more computational resources per update than SGD (Robbins & Monro, 1951). We reduce the computational cost per update of variance reduction methods by introducing a sparse gradient operator blending the top-K operator (Stich et al., 2018; Aji & Heafield, 2017) and the randomized coordinate descent operator. While the computational cost of computing the derivative of a model parameter is constant, we make the observation that the gains in variance reduction are proportional to the magnitude of the derivative. In this paper, we show that a sparse gradient based on the magnitude of past gradients reduces the computational cost of model updates without a significant loss in variance reduction. Theoretically, our algorithm is at least as good as the best available algorithm (e.g. SpiderBoost) under appropriate settings of parameters and can be much more efficient if our algorithm succeeds in capturing the sparsity of the gradients. Empirically, our algorithm consistently outperforms SpiderBoost using various models to solve various image classification tasks. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration.