资源论文Combinatorial Multi-Armed Bandit with General Reward Functions

Combinatorial Multi-Armed Bandit with General Reward Functions

2020-02-05 | |  49 |   39 |   0

Abstract

 In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework that allows a general nonlinear reward function, whose expected value may not depend only on the means of the input random variables but possibly on the entire distributions of these variables. Our framework enables a much larger class of reward functions such as the max() function and nonlinear utility functions. Existing techniques relying on accurate estimations of the means of random variables, such as the upper confidence bound (UCB) technique, do not work directly on these functions. We propose a new algorithm called stochastically dominant confidence bound (SDCB), which estimates the distributions of underlying random variables and their stochastically dominant confidence bounds.We prove that SDCB can achieve O(log T ) distribution-dependent regret and image.png  distribution-independent regret, where T is the time horizon. We apply our results to the K-MAX problem and expected utility maximization problems. In particular, for K-MAX, we provide the first polynomial-time? approximation scheme (PTAS) for its offline problem, and give the first image.png bound on the (image.png)-approximation regret of its online problem, for any image.png.

上一篇:Interpretable Nonlinear Dynamic Modeling of Neural Trajectories

下一篇:Regret Bounds for Non-decomposable Metrics with Missing Labels

用户评价
全部评价

热门资源

  • The Variational S...

    Unlike traditional images which do not offer in...

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...