Abstract
This work investigates the adversarial Bandits with
Knapsack (BwK) learning problem, where a player
repeatedly chooses to perform an action, pays the
corresponding cost of the action, and receives a reward associated with the action. The player is constrained by the maximum budget that can be spent
to perform the actions, and the rewards and the
costs of these actions are assigned by an adversary.
This setting is studied in terms of expected regret,
defined as the difference between the total expected
rewards per unit cost corresponding the best fixed
action and the total expected rewards per unit cost
of the learning algorithm. We propose a novel algorithm EXP3.BwK and show that the expected regret of the algorithm is order optimal in the budget.
We then propose another algorithm EXP3++.BwK,
which is order optimal in the adversarial BwK setting, and incurs an almost optimal expected regret
in the stochastic BwK setting where the rewards
and the costs are drawn from unknown underlying
distributions. These results are then extended to a
more general online learning setting, by designing
another algorithm EXP3++.LwK and providing its
performance guarantees. Finally, we investigate the
scenario where the costs of the actions are large
and comparable to the budget. We show that for
the adversarial setting, the achievable regret bounds
scale at least linearly with the maximum cost for
any learning algorithm, and are significantly worse
in comparison to the case of having costs bounded
by a constant, which is a common assumption in
the BwK literature