资源论文The Fixed Points of Off-Policy TD

The Fixed Points of Off-Policy TD

2020-01-08 | |  74 |   44 |   0

Abstract

Off-policy learning, the ability for an agent to learn about a policy other than the one it is following, is a key element of Reinforcement Learning, and in recent years there has been much work on developing Temporal Different (TD) algorithms that are guaranteed to converge under off-policy sampling. It has remained an open question, however, whether anything can be said a priori about the quality of the TD solution when off-policy sampling is employed with function approximation. In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. Furthermore, we show that we can efficiently project on to this convex set using only samples generated from the system. The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods.

上一篇:Kernel Embeddings of Latent Tree Graphical Models

下一篇:Transfer Learning by Borrowing Examples for Multiclass Object Detection

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...