Contextual Bandits With Cross-Learning

资源分类

2020-02-21 |

37 |

33 |

Abstract

In the classical contextual bandits problem, in each round t, a learner observes some context c, chooses some action a to perform, and receives some reward 图片.png We consider the variant of this problem where in addition to receiving the reward , the learner also learns the values of for all other contexts c0 ; i.e., the rewards that would have been achieved by performing that action under different contexts. This variant arises in several strategic settings, such as learning how to bid in non-truthful repeated auctions, which has gained a lot of attention lately as many platforms have switched to running first-price auctions. We call this problem the contextual bandits problem with cross-learning.? The best algorithms for the classical contextual bandits problem achieve 图片.png regret against all stationary policies, where C is the number of contexts, K the number of actions, and T the number of rounds. We demonstrate algorithms for the contextual bandits problem with cross-learning that remove the dependence on C and achieve regret . We simulate our algorithms on real auction data from an ad exchange running first-price auctions (showing that they outperform traditional contextual bandit algorithms).