资源论文Safe and efficient off-policy reinforcement learning

Safe and efficient off-policy reinforcement learning

2020-02-10 | |  94 |   53 |   0

Abstract

 In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(image.png), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of “off-policyness”; and (3) it is ef?cient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to image.png without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins’ Q(image.png), which was an open problem since 1989. We illustrate the benefits of Retrace(image.png) on a standard suite of Atari 2600 games.

上一篇:Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and IntrinsicMotivation

下一篇:Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...