Exploration Bonus for Regret Minimization in Discrete and Continuous Average Reward MDPs

资源分类

2020-02-19 |

32 |

29 |

Abstract

The exploration bonus is an effective approach to manage the explorationexploitation trade-off in Markov Decision Processes (MDPs). While it has been analyzed in infinite-horizon discounted and finite-horizon problems, we focus on designing and analysing the exploration bonus in the more challenging infinitehorizon undiscounted setting. We first introduce SCAL+ , a variant of SCAL [1], that uses a suitable exploration bonus to solve any discrete unknown weaklycommunicating MDP for which an upper bound c on the span of the optimal bias function is known. We prove that SCAL+ enjoys the same regret guarantees as SCAL, which relies on the less efficient extended value iteration approach. Furthermore, we leverage the flexibility provided by the exploration bonus scheme to generalize SCAL+ to smooth MDPs with continuous state space and discrete actions. We show that the resulting algorithm (SCCAL+ ) achieves the same regret bound as UCCRL [2] while being the first implementable algorithm for this setting.

上一篇：Outlier-Robust High-Dimensional Sparse Estimation via Iterative Filtering

下一篇：Exact Gaussian Processes on a Million Data Points

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com