Thompson Sampling for Complex Online Problems

资源分类

2020-03-04 |

118 |

123 |

Abstract

We consider stochastic multi-armed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms’ rewards, and the feedback observed may not necessarily be the reward perarm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parameter space without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets. Using particle filters for computing posterior distri butions which lack an explicit closed-form, we present numerical results for the performance of Thompson sampling for subset-selection and job Proceedings of the 31 st International Conference on MachLearning, Beijing, China, 2014. JMLR: W&CP volume 32. Copright 2014 by the author(s).

上一篇：Online Multi-Task Learning for Policy Gradient Methods

下一篇：Efficient Continuous-Time Markov Chain Estimation

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Deep Cross-media ...

Cross-media retrieval is a research hotspot in ...
Regularizing RNNs...

Recently, caption generation with an encoder-de...
Learning Expressi...

Facial expression is temporally dynamic event w...
Attributed Graph ...

Graph clustering is a fundamental task which di...
Compact MDDs for ...

Pseudo-Boolean (PB) constraints are usually en...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com