Abstract
Understanding the optimization trajectory is critical for understanding training of deep neural networks. We show how the hyperparameters of stochastic gradient descent influence the covariance of the gradients (K) and the Hessian of the training loss (H) along the trajectory. Based on a theoretical model, we conjecture that using a high learning rate or a small batch size leads SGD to regions of loss landscape, typically early during training, characterized by (1) reduced spectral norm of K, and (2) improved conditioning of K and H. We refer to the point on the training trajectory after which these effects hold as the break-even point. We demonstrate these effects empirically for a range of deep neural networks applied to different tasks. Finally, we apply our analysis to networks with batch normalization (BN) layers and find that it is necessary to use a higher learning rate to improve loss surface conditioning compared to a network without BN layers.