Abstract
We introduce a new Thompson sampling-based algorithm, called marginal posterior sampling, for
online slate bandits, that is characterized by three
key ideas. First, it postulates that the slate-level reward is a monotone function of the marginal unobserved rewards of the base actions selected in
the slates’s slots, but it does not attempt to estimate this function. Second, instead of maintaining a slate-level reward posterior, the algorithm
maintains posterior distributions for the marginal
reward of each slot’s base actions and uses the samples from these marginal posteriors to select the
next slate. Third, marginal posterior sampling optimizes at the slot-level rather than the slate-level,
which makes the approach computationally effi-
cient. Simulation results establish substantial advantages of marginal posterior sampling over alternative Thompson sampling-based approaches that
are widely used in the domain of web services