Safe and efficient off-policy reinforcement learning

资源分类

2020-02-10 |

94 |

53 |

Abstract

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of “off-policyness”; and (3) it is ef?cient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins’ Q(), which was an open problem since 1989. We illustrate the benefits of Retrace() on a standard suite of Atari 2600 games.

上一篇：Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and IntrinsicMotivation

下一篇：Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com