资源论文Thompson Sampling for Combinatorial Semi-Bandits

Thompson Sampling for Combinatorial Semi-Bandits

2020-03-16 | |  75 |   54 |   0

Abstract

We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distributiondependent regret bound of O(m log T /图片.pngmin ) for TS under general CMAB, where m is the number of arms, T is the time horizon, and ?min is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one cannot use an approximate oracle in TS algorithm for even MAB problems. Then we expand the analysis to matroid bandit, a special case of CMAB and for which we could remove the independence assumption across arms and achieve a better regret bound. Finally, we use some experiments to show the comparison of regrets of CUCB and CTS algorithms.

上一篇:Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization

下一篇:Augmented CycleGAN: Learning Many-to-Many Mappings from Unpaired Data

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...