Nostalgic Adam: Weighting More of the Past Gradients When Designing the
Adaptive Learning Rate
Abstract
First-order optimization algorithms have been
proven prominent in deep learning. In particular, algorithms such as RMSProp and Adam are
extremely popular. However, recent works have
pointed out the lack of “long-term memory” in
Adam-like algorithms, which could hamper their
performance and lead to divergence. In our study,
we observe that there are benefits of weighting
more of the past gradients when designing the
adaptive learning rate. We therefore propose an
algorithm called the Nostalgic Adam (NosAdam)
with theoretically guaranteed convergence at the
best known convergence rate. NosAdam can be
regarded as a fix to the non-convergence issue of
Adam in alternative to the recent work of [Reddi et
al., 2018]. Our preliminary numerical experiments
show that NosAdam is a promising alternative algorithm to Adam. The proofs, code, and other supplementary materials are already released