Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

资源分类

2020-01-16 |

66 |

40 |

Abstract

We consider an infinite-armed bandit problem with Bernoulli rewards. The mean rewards are independent, uniformly distributed over [0, 1]. Rewards 0 and 1 are referred to as a success and a failure, respectively. We propose a novel algorithm where the decision to exploit any arm is based on two successive targets, namely, the total number of successes until the first failure and until the first m failures, respectively, where m is a fixed ? parameter. This two-target algorithm achieves a long-term average regret in 图片.png for a large parameter m and a known time horizon n. This regret is optimal and?strictly less than the regret achieved by the best known algorithms, which is in The results are extended to any mean-reward distribution whose support contains 1 and to unknown time horizons. Numerical experiments show the performance of the algorithm for finite time horizons.

上一篇：More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

下一篇：Buy-in-Bulk Active Learning

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com