Minimax Regret Bounds for Reinforcement Learning

资源分类

2020-03-09 |

64 |

35 |

Abstract

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification?to value iteration achieves ? a regret bound of 图片.png where H is the time horizon, S the number of states, A the number of actions and T the number of timesteps. This result improves ? over the best previous known bound achieved by the UCRL2 algorithm of Jaksch et al. (2010). The key significance of our new results is that when it leads to a regret of 图片.png that matches the established lower bound of up to a logarithmic factor. Our analysis contains two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in S), and we define Bernstein-based "exploration bonuses" that use the empirical variance of the estimated values at the next states (t improve scaling in H).