资源论文Minimax Regret Bounds for Reinforcement Learning

Minimax Regret Bounds for Reinforcement Learning

2020-03-09 | |  64 |   35 |   0

Abstract

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification?to value iteration achieves ? a regret bound of 图片.png where H is the time horizon, S the number of states, A the number of actions and T the number of timesteps. This result improves ? over the best previous known bound 图片.png achieved by the UCRL2 algorithm of Jaksch et al. (2010). The key significance of our new results is that when 图片.png it leads to a regret of 图片.png that matches the established lower bound of 图片.png up to a logarithmic factor. Our analysis contains two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in S), and we define Bernstein-based "exploration bonuses" that use the empirical variance of the estimated values at the next states (t improve scaling in H).

上一篇:Neural Message Passing for Quantum Chemistry

下一篇:Warped Convolutions: Efficient Invariance to Spatial Transformations

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...