资源论文Optimistic posterior sampling for reinforcement learning: worst-case regret bounds

Optimistic posterior sampling for reinforcement learning: worst-case regret bounds

2020-02-10 | |  37 |   39 |   0

Abstract 

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of image.png for any communicating MDP with S states, A actions and diameter D, when image.png A. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon T . This result improves over the best previously known upper bound of image.png achieved by any algorithm in this setting, and matches the dependence on S in the established lower bound of  image.png for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

上一篇:Question Asking as Program Generation

下一篇:Dual-Agent GANs for Photorealistic and Identity Preserving Profile Face Synthesis

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...