资源论文DiCE: The Infinitely Differentiable Monte Carlo Estimator

DiCE: The Infinitely Differentiable Monte Carlo Estimator

2020-03-16 | |  38 |   35 |   0

Abstract

The score function estimator is widely used for estimating gradients of stochastic objectives in stochastic computation graphs (SCG), e.g., in reinforcement learning and meta-learning. While deriving the first order gradient estimators by dif ferentiating a surrogate loss (SL) objective is computationally and conceptually simple, using the same approach for higher order derivatives is more challenging. Firstly, analytically deriving and implementing such estimators is laborious and not compliant with automatic differentiation. Secondly, repeatedly applying SL to construct new objectives for each order derivative involves increasingly cumbersome graph manipulations. Lastly, to match the first order gradient under differentiation, SL treats part of the cost as a fixed sample, which we show leads to missing and wrong terms for estimators of higher order derivatives. To address all these shortcomings in a unified way, we introduce D I CE, which provides a single objective that can be differentiated repea edly, generating correct estimators of derivatives of any order in SCGs. Unlike SL, D I CE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of D I CE both through a proof and numerical evaluation of the D I CE derivative estimates. We also use D I CE to propose and evaluate a novel approach for multi-agent learning. Our code is available at github.com/alshedivat/lola.

上一篇:Anonymous Walk Embeddings

下一篇:Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go?

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...