Abstract
We study the dynamic regret of multi-armed bandit and experts problem in nonstationary stochastic environments. We introduce a new parameter , which measures the total statistical variance of the loss distributions over T rounds of the process, and study how this amount affects the regret. We investigate the interaction between and , which counts the number of times the distributions change, as well as and V , which measures how far the distributions deviates over time. One striking result we find is that even when , V , and ? are all restricted to constant, the regret lower bound in the bandit setting still grows with T . The other highlight is that in the full-information setting, a constant regret becomes achievable with constant and , as it can be made independent of T , while with constant V and , the regret still has a dependency. We not only propose algorithms with upper bound guarantee, but prove their matching lower bounds as well.