资源论文ESCAPING SADDLE POINTS FASTER WITHS TOCHASTIC MOMENTUM

ESCAPING SADDLE POINTS FASTER WITHS TOCHASTIC MOMENTUM

2020-01-02 | |  67 |   46 |   0

Abstract

Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a “momentum” term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM (Kingma & Ba (2015)), AMSGrad (Reddi et al. (2018b)), etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter–our analysis suggests that 图片.png should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions.

上一篇:GRAPH AF: AF LOW- BASED AUTOREGRESSIVEM ODEL FOR MOLECULAR GRAPH GENERATION

下一篇:ADVERSARIAL TRAINING AND PROVABLE DEFENSES :B RIDGING THE GAP

用户评价
全部评价

热门资源

  • The Variational S...

    Unlike traditional images which do not offer in...

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • Learning to learn...

    The move from hand-designed features to learned...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...