Scalable Adaptive Stochastic Optimization Using Random Projections

资源分类

2020-02-05 |

71 |

56 |

Abstract

Adaptive stochastic gradient methods such as A DAG RAD have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by accumulating past gradients which are used to tune the step size adaptively. In certain situations the full-matrix variant of A DAG RAD is expected to attain better performance, however in high dimensions it is computationally impractical. We present A DA LR and R ADAG RAD two computationally efficient approximations to full-matrix A DAG RAD based on randomized dimensionality reduction. They are able to capture dependencies between features and achieve similar performance to full-matrix A DAG RAD but at a much smaller computational cost. We show that the regret of A DA LR is close to the regret of full-matrix A DAG RAD which can have an up-to exponentially smaller dependence on the dimension than the diagonal variant. Empirically, we show that A DA LR and R ADAG RAD perform similarly to full-matrix A DAG RAD. On the task of training convolutional neural networks as well as recurrent neural networks, R ADAG RAD achieves faster convergence than diagonal A DAG RAD.

上一篇：A Posteriori Error Bounds for Joint Matrix Decomposition Problems

下一篇：Barzilai-Borwein Step Size for Stochastic Gradient Descent

用户评价

全部评价

还没有评论，说两句吧！

热门资源

The Variational S...

Unlike traditional images which do not offer in...
Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com