资源论文Online EXP3 Learning in Adversarial Bandits with Delayed Feedback

Online EXP3 Learning in Adversarial Bandits with Delayed Feedback

2020-02-19 | |  55 |   43 |   0

Abstract

Consider a player that in each of T rounds chooses one of K arms. An adversary chooses the cost of each arm in a bounded interval, and a sequence of feedback delays {图片.png } that are unknown to the player. After picking arm at at round t, the player receives the cost of playing this arm 图片.png rounds later. In cases where t + 图片.png > T , this feedback is simply missing. We prove that the EXP3 algorithm r (that uses the delayedfeedback upon its arPT rival) achieves a regret of 图片.png . For the case where 图片.png dt and T are unknown, we propose a novel doubling trick for online learning r with delays and prove that this adaptive EXP3 achieves a regret of  图片.png . We then consider a two player zero-sum game where players experience asynchronous delays. We show that even when the delays are large enough such that players no longer enjoy the “no-regret property”, (e.g., where 图片.png= 图片.png the ergodic average of the strategy profile still converges to the set of Nash equilibria of the game. The result is made possible by choosing an adaptive step size 图片.png that is not summable but is square summable, and proving a “weighted regret bound” for this general case.

上一篇:Learning Disentangled Representations for Recommendation

下一篇:An Adaptive Empirical Bayesian Method for Sparse Deep Learning

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...