Abstract
Deep learning architectures are usually proposed
with millions of parameters, resulting in a memory issue when training deep neural networks with
stochastic gradient descent type methods using
large batch sizes. However, training with small
batch sizes tends to produce low quality solution
due to the large variance of stochastic gradients. In
this paper, we tackle this problem by proposing a
new framework for training deep neural network
with small batches/noisy gradient. During optimization, our method iteratively applies a proximal
type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading
to better training performance. We prove that our
algorithm achieves comparable convergence rate
as vanilla SGD even with small batch size. Our
framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our
method outperforms SGD and Adam when batch
size is small. Our implementation is available at
https://github.com/huiqu18/TRAlgorithm