Abstract
Classic multi-armed bandit algorithms are ineffi-
cient for a large number of arms. On the other
hand, contextual bandit algorithms are more effi-
cient, but they suffer from a large regret due to
the bias of reward estimation with finite dimensional features. Although recent studies proposed
semi-parametric bandits to overcome these defects,
they assume arms’ features are constant over time.
However, this assumption rarely holds in practice,
since real-world problems often involve underlying processes that are dynamically evolving over
time especially for the special promotions like Singles’ Day sales. In this paper, we formulate a novel
Semi-Parametric Contextual Bandit Problem to relax this assumption. For this problem, a novel TwoSteps Upper-Confidence Bound framework, called
Semi-Parametric UCB (SPUCB), is presented. It
can be flexibly applied to linear parametric function problem with a satisfied gap-free bound on the
n-step regret. Moreover, to make our method more
practical in online system, an optimization is proposed for dealing with high dimensional features
of a linear function. Extensive experiments on synthetic data as well as a real dataset from one of the
largest e-commercial platforms demonstrate the superior performance of our algorithm