资源论文Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

2020-02-26 | |  60 |   50 |   0

Abstract

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) — the problem of evaluating a new policy using the historical data obtained by different behavior policies — under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon H. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of 

图片.png

where µ and π are the logging and target policies, 图片.png and 图片.png are the marginal distribution of the state at tth step, H is the horizon, n is the sample size and 图片.png is the value function of the MDP under π. The result matches the Cramer-Rao lower bound in Jiang and Li [2016] up to a multiplicative factor of H. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on H. Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments.

上一篇:Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

下一篇:Convergent Policy Optimization for Safe Reinforcement Learning

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...