资源论文DON ’T USE LARGE MINI -BATCHES ,U SE LOCAL SGD

DON ’T USE LARGE MINI -BATCHES ,U SE LOCAL SGD

2020-01-02 | |  95 |   38 |   0

Abstract

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a post-local SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of local SGD variants.

上一篇:TAB FACT: AL ARGE -SCALE DATASET FOR TABLE -BASED FACT VERIFICATION

下一篇:DECODING AS DYNAMIC PROGRAMMING FOR RECUR -RENT AUTOREGRESSIVE MODELS

用户评价
全部评价

热门资源

  • The Variational S...

    Unlike traditional images which do not offer in...

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...