资源论文Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems

Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems

2020-02-19 | |  75 |   55 |   0

Abstract

Restless bandit problems are instances of non-stationary multi-armed bandits. These problems have been studied well from the optimization perspective, where the goal is to efficiently find a near-optimal policy when system parameters are known. However, very few papers adopt a learning perspective, where the parameters are unknown. In this paper, we analyze the performance of Thompson sampling in episodic restless bandits with unknown parameters.  We consider a general policy map to define our competitor and prove an 图片.png Bayesian regret bound. Our competitor is flexible enough to represent various benchmarks including the best fixed action policy, the optimal policy, the Whittle index policy, or the myopic policy. We also present empirical results that support our theoretical findings.

上一篇:On two ways to use determinantal point processes for Monte Carlo integration

下一篇:Hypothesis Set Stability and Generalization

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...