Optimistic posterior sampling for reinforcement learning: worst-case regret bounds

资源分类

2020-02-10 |

37 |

39 |

Abstract

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of for any communicating MDP with S states, A actions and diameter D, when A. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon T . This result improves over the best previously known upper bound of achieved by any algorithm in this setting, and matches the dependence on S in the established lower bound of for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

上一篇：Question Asking as Program Generation

下一篇：Dual-Agent GANs for Photorealistic and Identity Preserving Profile Face Synthesis

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com