Abstract
We provide a new analysis framework for the adversarial multi-armed bandit problem. Using the notion of convex smoothing, we define a novel family of algorithms with minimax optimal regret guarantees. First, we show that regularizationpvia the Tsallis entropy, which includes EXP3 as a special case, matches the minimax regret with a smaller constant factor. Second, we show that a p wide class of perturbation methods achieve a near-optimal regret as low as as long as the perturbation distribution has a bounded hazard function. For example, the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this key property and lead to near-optimal algorithms.