Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

资源分类

2020-02-26 |

60 |

50 |

Abstract

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) — the problem of evaluating a new policy using the historical data obtained by different behavior policies — under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon H. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of

图片.png

where µ and π are the logging and target policies, 图片.png and are the marginal distribution of the state at tth step, H is the horizon, n is the sample size and is the value function of the MDP under π. The result matches the Cramer-Rao lower bound in Jiang and Li [2016] up to a multiplicative factor of H. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on H. Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments.

上一篇：Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

下一篇：Convergent Policy Optimization for Safe Reinforcement Learning

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com