Thompson Sampling for Multinomial Logit Contextual Bandits

资源分类

2020-02-19 |

57 |

42 |

Abstract

We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments that maximizes the expected cumulative revenue, or alternatively, minimize the expected regret. The feedback here is the item that the user picks from the assortment. The distinguishing feature in this work is that this feedback is given by a multinomial logit choice model. The utility of each item is a dynamic function of contextual information of both the item and the user. We refer to this problem as the multinomial logit contextual bandit. We propose two Thompson sampling algorithms for this multinomial logit contextual bandit. Our first algorithm p maintains a posterior distribution of the true parameter and e establishes 图片.png 1 Bayesian regret over T rounds with d dimensional context vector. The second algorithm approximates the posterior by a Gaussian distribution, and uses a new optimistic sampling procedure to address the p issues that arise in worst-case regret analysis. This algorithm achieves worst-case (frequentist) regret bound. The numerical experiments show that the practical performance of both methods is in line with the theoretical guarantees.

上一篇：MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

下一篇：Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift

用户评价

全部评价

还没有评论，说两句吧！

热门资源

The Variational S...

Unlike traditional images which do not offer in...
Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com