Abstract
We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds O(t) i.i.d. pseudo-rewards
to its history in round t and then pulls the arm with
the highest average reward in its perturbed history.
Therefore, we call it perturbed-history exploration
(PHE). The pseudo-rewards are carefully designed
to offset potentially underestimated mean rewards
of arms with a high probability. We derive nearoptimal gap-dependent and gap-free bounds on the
n-round regret of PHE. The key step in our analysis is a novel argument that shows that randomized
Bernoulli rewards lead to optimism. Finally, we
empirically evaluate PHE and show that it is competitive with state-of-the-art baselines.